<a href="https://colab.research.google.com/github/luishpinto/text-classification/blob/master/text_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Text Classification with Python and Scikit-Learn**

by Luis H PINTO

The 1st step to be considered is adding the libraries with the necessary packages and functions to the **Python** routines.

1. Importing **github** makes possible to import files from **GitHub** repositories.



In [1]:
try:
  !rm -fr './text-classification'
  !git clone 'https://github.com/luishpinto/text-classification'
except:
  !git clone 'https://github.com/luishpinto/text-classification'

Cloning into 'text-classification'...
remote: Enumerating objects: 55, done.[K
remote: Counting objects: 100% (55/55), done.[K
remote: Compressing objects: 100% (51/51), done.[K
remote: Total 55 (delta 17), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (55/55), done.


2. Both **Numpy** and **Pandas** are default library to be used in generic maths and dataframe manipulation.

In [2]:
import numpy as np
import pandas as pd

3. The **Scikit-Learn** or simply **sklearn** libraries are the core of the **Machine Learning** process. In this notebook a sort of functions and regression models are used.

> The functions:

> * **CountVectorizer** -- convert a collection of text documents to a matrix of token counts;

> * **train_test_split** -- split arrays or matrices into random train and test subsets;

> * **Pipeline** -- sequentially apply a list of transforms and a final estimator.

> The regression models:

> * **LogisticRegression** -- logistic regression classifier;

> * **SGDClassifier** -- linear classifiers with stochastic gradient descent (SGD) training.





In [3]:
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier

from sklearn.tree import DecisionTreeClassifier

from sklearn.pipeline import Pipeline

The 2nd step is related to the data organization.

1. Transfering the data-files from the **GitHub** respository.

In [4]:
files = !ls -d -1 './text-classification/data-base/'*

2. Creating the **dataframe**.

In [5]:
flist = dict()
flist['yelp'] = files[0]
flist['imdb'] = files[1]
flist['amazon'] = files[2]

In [6]:
dflist = []
for source,path in flist.items():
  df = pd.read_csv(path,names = ['sentence','label'],sep = '\t')
  df['source'] = source
  dflist.append(df)

df = pd.concat(dflist)

print(df.iloc[0])

sentence    So there is no way for me to plug it in here i...
label                                                       0
source                                                   yelp
Name: 0, dtype: object


4. The input for any **Regression** or **Classifier** model must be a numeric matrix, so it is necessary to transform sentences in matrix. This operation is performed by the **CountVectorizer** function.

In [7]:
sentences = ['John likes ice cream','John hates chocolate.']

vectorizer = CountVectorizer(min_df = 0,lowercase = False)
vectorizer.fit(sentences)

print(vectorizer.vocabulary_)

{'John': 0, 'likes': 5, 'ice': 4, 'cream': 2, 'hates': 3, 'chocolate': 1}


In [8]:
print(vectorizer.transform(sentences).toarray())

[[1 0 1 0 1 1]
 [1 1 0 1 0 0]]


5. Testing the accuracity of the **LogisticRegression**, **DecisionTreeClassifier** and **SGFClassifier**.

In [9]:
classifiers = [LogisticRegression(),
               DecisionTreeClassifier(),
               SGDClassifier()]

for clf in classifiers:

  print('\n\nClassifier: {}\n'.format(clf))

  for i in df['source'].unique():
    dfsource = df[df['source'] == i]
    S = dfsource['sentence'].values
    y = dfsource['label'].values

    Strain,Stest,ytrain,ytest = train_test_split(S,y,test_size = 0.25,random_state = 1000)

    vectorizer = CountVectorizer()
    vectorizer.fit(Strain)

    Xtrain = vectorizer.transform(Strain)
    Xtest = vectorizer.transform(Stest)

    pipe = Pipeline(steps = [('classifier',clf)])
    pipe.fit(Xtrain,ytrain)
    score = pipe.score(Xtest,ytest)

    print('Accuracy for {} dataset: {:.3f}'.format(i,score))



Classifier: LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

Accuracy for yelp dataset: 0.796
Accuracy for imdb dataset: 0.749
Accuracy for amazon dataset: 0.796


Classifier: DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

Accuracy for yelp dataset: 0.792
Accuracy for imdb dataset: 0.626
Accuracy for amazon dataset: 0.764


Classifier: S

In [10]:
mood = ['Bad mood','Good mood']

## enter the sentence
sentence = "Crust is no good"

print(mood[pipe.predict(vectorizer.transform([sentence]))[0]])

Bad mood
