Scikit learn, a library which is written in Python and built upon Scipy, Matplotlib and Numpy provides a set of useful and efficient tools for machine learning and statistical modeling including regression, classification, clustering, predictive data analysis and dimensionality reduction etc and known as the most robust and useful library for Machine Learning.

In [3]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import re
from nltk.stem.porter import PorterStemmer
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegressionCV

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


### Load the Data

In [5]:
df= pd.read_csv('https://raw.githubusercontent.com/thunderstroke325/60-Days-of-Data-Science-and-ML/main/datasets/imdb_data.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


### Transforming Documents into Feature Vectors

In [6]:
count = CountVectorizer()
docs = np.array (['The sun is shining',
'The weather is sweet',
'The sun is shining, the weather is sweet, and one and one is two'])
bag = count.fit_transform(docs)
print(bag.toarray())

[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]


In [7]:
print(count.vocabulary_)

{'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}


### Term Frequency-Inverse Document Frequency

In [9]:
np.set_printoptions(precision = 2)
tfidf = TfidfTransformer(use_idf = True, norm ='l2',smooth_idf = True)

### Data Preparation

In [10]:
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower()) +\
        ' '.join(emoticons).replace('-', '')
    return text
preprocessor(df.loc[0,'review'][-50:])
df['review'] = df['review'].apply(preprocessor)

### Document Tokenization

In [11]:
porter = PorterStemmer()
def tokenizer(text):
    return text.split()
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]
print(tokenizer('Its a wonderful day'))
print(tokenizer_porter('Beautiful day'))
stop = stopwords.words('english')
[w for w in tokenizer_porter('Snow feels magical during New year')[-5:] if w not in stop]

['Its', 'a', 'wonderful', 'day']
['beauti', 'day']


['feel', 'magic', 'dure', 'new', 'year']

### Document Classification Using Logistic Regression

In [12]:
tfidf = TfidfVectorizer(
                            strip_accents = None, 
                            lowercase=False, preprocessor=None, tokenizer= tokenizer_porter,
                            use_idf = True, norm='l2',
                            smooth_idf= True
)
y = df.sentiment.values
X = tfidf.fit_transform(df.review)
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=1,test_size=0.5,shuffle=False)
clf = LogisticRegressionCV( cv=5, scoring = 'accuracy', random_state=0,n_jobs=-1,verbose = 3, max_iter = 300).fit(X_train,y_train)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  3.0min finished


### Model Evaluation

In [13]:
clf.score(X_test,y_test)

0.89472