### scikit-learn is a Python package designed for machine learning applications.
### Together with NumPy and pandas, it’s another core component in the Python data science ecosystem. 
### scikit-learn provides efficient, easy-to-use tools that can be applied to common machine learning problems, including exploratory and predictive data analysis.

In [9]:
# Loading the sample dataset into a pandas DF
import pandas as pd
df = pd.read_csv("./sentiment labelled sentences/amazon_cells_labelled.txt", names=["review","sentiment"], sep='\t')
print(df)

                                                review  sentiment
0    So there is no way for me to plug it in here i...          0
1                          Good case, Excellent value.          1
2                               Great for the jawbone.          1
3    Tied to charger for conversations lasting more...          0
4                                    The mic is great.          1
..                                                 ...        ...
995  The screen does get smudged easily because it ...          0
996  What a piece of junk.. I lose more calls on th...          0
997                       Item Does Not Match Picture.          0
998  The only thing that disappoint me is the infra...          0
999  You can not answer calls with the unit, never ...          0

[1000 rows x 2 columns]


In [10]:
# split into training set and test set
# one to train the predictive model, one to test its accuracy
from sklearn.model_selection import train_test_split
reviews = df["review"].values
sentiments = df["sentiment"].values
reviewsTrain, reviewsTest, sentimentsTrain, sentimentsTest = train_test_split(reviews, sentiments, test_size=0.2, random_state=500)

In [11]:
# Transforming text into numerical feature vectors
# Bag of Words (BoW) model needed to represent a text as a set of its words in order to generate numerical data about the text
# Most common numerical feature generated from BoW model is word frequency
# lets use CountVectorizer() from sklearn.feature_extraction.text count frequency of words

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vectorizer.fit(reviews)
xTrain = vectorizer.transform(reviewsTrain)
xTest = vectorizer.transform(reviewsTest)

In [12]:
# Using LogisticRegression() classifier to predict the sentiment of a review. 
# Logistic regression is a basic yet popular algorithm for solving classification problems.
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression() # created LogisticRegression() classifier
classifier.fit(xTrain, sentimentsTrain) # used fit() to train the model according to the given training data

In [13]:
accuracy = classifier.score(xTest, sentimentsTest)
print(f"Accuracy: {accuracy}") 

Accuracy: 0.81


In [14]:
# Making Predictions on New Data
new_reviews = ["Old version of python useless", "Very good effort, but not five stars", "Clear and concise"]
xNew = vectorizer.transform(new_reviews)
print(classifier.predict(xNew))

[0 0 1]
