http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

In [23]:
from __future__ import print_function
import tensorflow as tf
import numpy as np
from six.moves import cPickle as pickle
from six.moves import range

import pandas as pd
import string

In [5]:
column_names = ['Assessment', 'Docid', 'Title', 'Authors', 'Journal',
                'ISSN', 'Year', 'Language', 'Abstract', 'Keywords']

In [38]:
df_dev = pd.read_csv('phase1.dev.shuf.tsv', sep='\t', names=column_names)

In [40]:
df_train = pd.read_csv('phase1.train.shuf.tsv', sep='\t', names=column_names)
df_train.head()

Unnamed: 0,Assessment,Docid,Title,Authors,Journal,ISSN,Year,Language,Abstract,Keywords
0,-1,hash:3f1ebe70-a242-3b43-843c-eef89284607a,Misoprostol for treating postpartum haemorrhag...,"Hofmeyr, G. J.;Ferreira, S.;Nikodem, V. C.;Man...",BMC Pregnancy and Childbirth,1471-2393,2004,eng,Background: Postpartum haemorrhage remains an ...,South Africa;adult;article;blood transfusion;c...
1,-1,hash:aa35378f-0460-37f1-b001-ac735e027333,Vitamin A supplements and diarrheal and respir...,"Fawzi, W. W.;Mbise, R.;Spiegelman, D.;Fataki, ...",J Pediatr,0022-3476,2000,eng,OBJECTIVE: To determine the effect of vitamin ...,"Child, Preschool;Diarrhea/ epidemiology;Dietar..."
2,-1,hash:3ddd7e14-a607-3313-a74f-613c988206f3,The efficacy and safety of a controlled releas...,"Gathua, S. N.;Aluoch, J. A.",East Afr Med J,,1990,eng,The treatment of asthma in Africa is influence...,Adult;Albuterol/administration & dosage/advers...
3,-1,hash:41e91fb1-6cfe-3347-aa26-4424e0afe11e,The state of the art of education for child su...,"Mrisho, F. H.",BERC Bull,,1987,eng,PIP: Tanzania has both a high infant mortality...,"Africa;Africa South of the Sahara;Africa, East..."
4,-1,hash:92d8601c-ef4f-39b1-b65c-ce1c7325ba6f,[The practicability of preceptorship in the cu...,"Lin, C. C.;Lo, K. M.;Leu, C. S.",Kaohsiung J Med Sci,,1996,eng,Physicians who have graduated from traditional...,"Adult;Aged;Curriculum;Education, Medical;Engli..."


In [41]:
df_train.count()

Assessment    21662
Docid         21662
Title         21662
Authors       21250
Journal       20837
ISSN          13422
Year          20981
Language      21661
Abstract      21661
Keywords      20912
dtype: int64

## Extracting features from text files

#### Bags of words ?

#### Tokenizing text with `scikit-learn`

In [32]:
from sklearn.feature_extraction.text import CountVectorizer

Text preprocessing, tokenizing and filtering of stopwords are included in a high level componenent that is able to build a dictionary of features and transform document to feature vectors:

In [42]:
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(df_train['Title'])

In [43]:
X_train_counts

<21662x17410 sparse matrix of type '<type 'numpy.int64'>'
	with 262030 stored elements in Compressed Sparse Row format>

#### From occurrences to frequencies

In [33]:
from sklearn.feature_extraction.text import TfidfTransformer

In [44]:
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
X_train_tf.shape

(21662, 17410)

In [45]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(21662, 17410)

## Training a classifier

In [36]:
from sklearn.naive_bayes import MultinomialNB

In [46]:
clf = MultinomialNB().fit(X_train_tfidf, df_train['Assessment'])

To try to predict the outcome on a new document, we need to extract the features using almost the same feature extracting chain as before. The differenceis that we call `transform` instead of `fit_transform` on the transformers, since they have already been fit to the training set:

In [54]:
X_dev_counts = count_vect.transform(df_dev['Title'])
X_dev_tfidf = tfidf_transformer.transform(X_dev_counts)

predicted = clf.predict(X_dev_tfidf)
predicted

array([-1, -1, -1, ..., -1, -1, -1])

In [53]:
df_predicted = df_dev
df_predicted['Predicted Assessment'] = predicted
df_predicted.head()

Unnamed: 0,Assessment,Docid,Title,Authors,Journal,ISSN,Year,Language,Abstract,Keywords,Predicted Assessment
0,-1,hash:9ef84cbb-2f69-3bdc-872f-b986823fe4cd,Educational needs in patient care practices in...,"Seto, W. H.;Ong, S. G.;Ching, T. Y.;Liu, S. H....",Am J Infect Control,0196-6553,1988.0,eng,We conducted a survey on staff perceptions of ...,Cross Infection/ prevention & control;Disinfec...,-1
1,-1,hash:ee1f7198-f1cf-3ee9-a794-ad7d5c3165be,"Methods, equipment and techniques for rural he...","Ramalingaswami, V.",Proc R Soc Lond B Biol Sci,0080-4649 (Print),1980.0,eng,It is obvious that any strategy for village he...,Community Health Aides/utilization;Delivery of...,-1
2,-1,hash:4a24e3aa-ebc1-3826-b98c-302815763ede,Limitations in verbal fluency following heavy ...,"Patrick, P. D.;Oria, R. B.;Madhavan, V.;Pinker...",Child Neuropsychology,0929-7049,2005.0,eng,The effects of heavy burdens of diarrhea in th...,Brazil;academic achievement;article;breast fee...,-1
3,-1,hash:e57a42d3-db98-3b5b-b7ad-5b2efb73f76e,Attitude towards rape: a comparative study amo...,"Sivagnanam, G.;Bairy, K. L.;D'Souza, U.",Med J Malaysia,,2005.0,eng,The global statistics reveal that at least one...,Adolescent;Adult;Attitude of Health Personnel;...,-1
4,-1,hash:e2a94bd0-10e9-3ca1-a12f-18705ef15b74,An evaluation of a training workshop for pharm...,"Sinclair, H.;Bond, C.;Lennox, S.;Silcock, J.;W...",Health Education Journal 1997 Sep; 56(3): 296-...,,,eng,This paper details part of the findings of a r...,Adult;Audiorecording;Change Theory;Chi Square ...,-1


In [57]:
df_predicted[df_predicted['Assessment'] == df_predicted['Predicted Assessment']].count()

Assessment              4700
Docid                   4700
Title                   4700
Authors                 4700
Journal                 4535
ISSN                    2900
Year                    4561
Language                4700
Abstract                4700
Keywords                4543
Predicted Assessment    4700
dtype: int64

## Evaluation of the performance on the test set