# AV - Sentiment Analysis - NLP Project

Problem Statement
Sentiment analysis remains one of the key problems that has seen extensive application of natural language processing. This time around, given the tweets from customers about various tech firms who manufacture and sell mobiles, computers, laptops, etc, the task is to identify if the tweets have a negative sentiment towards such companies or products.

 

Evaluation Metric
The metric used for evaluating the performance of classification model would be weighted F1-Score.

## Imports
 **Import the usual suspects. :) **

In [1]:
import numpy as np
import pandas as pd

## The Data

**Read the train.csv and test.csv file **

In [2]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

** Check the head, info , and describe methods on yelp.**

In [None]:
train.head()

In [None]:
train.info()

In [None]:
train.describe()

# EDA

Let's explore the data

## Imports

**Import the data visualization libraries if you haven't done so already.**

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
%matplotlib inline

**Use FacetGrid from the seaborn library to create a grid of 5 histograms of text length based off of the star ratings. Reference the seaborn documentation for hints on this**

In [None]:
train['tweet_len'] = train['tweet'].apply(len)

In [None]:
train.head()

In [None]:
g = sns.FacetGrid(train,col='label')
g.map(plt.hist,'tweet_len')

**Create a boxplot of text length for each star category.**

In [None]:
sns.boxplot(x='label',y='tweet_len',data=train,palette='rainbow')

**Create a countplot of the number of occurrences for each type of star rating.**

In [None]:
sns.countplot(x='label',data=train,palette='rainbow')

## NLP Classification Task

** Create two objects X and y. X will be the 'text' column of yelp_class and y will be the 'stars' column of yelp_class. (Your features and target/labels)**

In [3]:
X = train['tweet']
y = train['label']

## Text Preprocessing

In [4]:
import re

In [5]:
from nltk import word_tokenize
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

def remove_stopwords(string):
    word_list = [word.lower() for word in string.split()]
    stopwords_list = list(stopwords.words("english"))
    for word in word_list:
        if word in stopwords_list:
            word_list.remove(word)
    return ' '.join(word_list)

In [6]:
train['tweet'] = train['tweet'].map(lambda x: re.sub('\\n',' ',str(x)))
train['tweet'] = train['tweet'].map(lambda x: re.sub(r'\W',' ',str(x)))
train['tweet'] = train['tweet'].map(lambda x: re.sub(r'https\s+|www.\s+',r'', str(x)))
train['tweet'] = train['tweet'].map(lambda x: re.sub(r'http\s+|www.\s+',r'', str(x)))
train['tweet'] = train['tweet'].map(lambda x: re.sub(r'\s+[a-zA-Z]\s+',' ',str(x)))
train['tweet'] = train['tweet'].map(lambda x: re.sub(r'\^[a-zA-Z]\s+',' ',str(x)))
train['tweet'] = train['tweet'].map(lambda x: re.sub(r'\s+',' ',str(x)))
train['tweet'] = train['tweet'].str.lower()

train['tweet'] = train['tweet'].map(lambda x: re.sub(r"\’", "\'", str(x)))
train['tweet'] = train['tweet'].map(lambda x: re.sub(r"won\'t", "will not", str(x)))
train['tweet'] = train['tweet'].map(lambda x: re.sub(r"can\'t", "can not", str(x)))
train['tweet'] = train['tweet'].map(lambda x: re.sub(r"don\'t", "do not", str(x)))
train['tweet'] = train['tweet'].map(lambda x: re.sub(r"dont", "do not", str(x)))
train['tweet'] = train['tweet'].map(lambda x: re.sub(r"n\’t", " not", str(x)))
train['tweet'] = train['tweet'].map(lambda x: re.sub(r"n\'t", " not", str(x)))
train['tweet'] = train['tweet'].map(lambda x: re.sub(r"\'re", " are", str(x)))
train['tweet'] = train['tweet'].map(lambda x: re.sub(r"\'s", " is", str(x)))
train['tweet'] = train['tweet'].map(lambda x: re.sub(r"\’d", " would", str(x)))
train['tweet'] = train['tweet'].map(lambda x: re.sub(r"\d", " would", str(x)))
train['tweet'] = train['tweet'].map(lambda x: re.sub(r"\'ll", " will", str(x)))
train['tweet'] = train['tweet'].map(lambda x: re.sub(r"\'t", " not", str(x)))
train['tweet'] = train['tweet'].map(lambda x: re.sub(r"\'ve", " have", str(x)))
train['tweet'] = train['tweet'].map(lambda x: re.sub(r"\'m", " am", str(x)))
train['tweet'] = train['tweet'].map(lambda x: re.sub(r"\n", "", str(x)))
train['tweet'] = train['tweet'].map(lambda x: re.sub(r"\r", "", str(x)))
train['tweet'] = train['tweet'].map(lambda x: re.sub(r"[0-9]", "digit", str(x)))
train['tweet'] = train['tweet'].map(lambda x: re.sub(r"\'", "", str(x)))
train['tweet'] = train['tweet'].map(lambda x: re.sub(r"\"", "", str(x)))
train['tweet'] = train['tweet'].map(lambda x: re.sub(r'[?|!|\'|"|#]',r'', str(x)))
train['tweet'] = train['tweet'].map(lambda x: re.sub(r'[.|,|)|(|\|/]',r' ', str(x)))
train['tweet'] = train['tweet'].apply(lambda x: remove_stopwords(x))
train.head(10)

Unnamed: 0,id,label,tweet
0,1,0,fingerprint pregnancy test goo gl h wouldmfqv ...
1,2,0,finally transparant silicon case thanks my unc...
2,3,0,love would go talk makememories unplug relax i...
3,4,0,wired know george made way iphone cute daventr...
4,5,1,amazing service apple even talk me question un...
5,6,1,iphone software update fucked my phone big tim...
6,7,0,happy us instapic instadaily us sony xperia xp...
7,8,0,new type charger cable uk www ebay co uk itm w...
8,9,0,bout go shopping listening music iphone justme...
9,10,0,photo fun selfie pool water sony camera picoft...


In [7]:
test['tweet'] = test['tweet'].map(lambda x: re.sub('\\n',' ',str(x)))
test['tweet'] = test['tweet'].map(lambda x: re.sub(r'\W',' ',str(x)))
test['tweet'] = test['tweet'].map(lambda x: re.sub(r'\s+[a-zA-Z]\s+',' ',str(x)))
test['tweet'] = test['tweet'].map(lambda x: re.sub(r'\^[a-zA-Z]\s+',' ',str(x)))
test['tweet'] = test['tweet'].map(lambda x: re.sub(r'\s+',' ',str(x)))
test['tweet'] = test['tweet'].str.lower()

test['tweet'] = test['tweet'].map(lambda x: re.sub(r"\’", "\'", str(x)))
test['tweet'] = test['tweet'].map(lambda x: re.sub(r"won\'t", "will not", str(x)))
test['tweet'] = test['tweet'].map(lambda x: re.sub(r"can\'t", "can not", str(x)))
test['tweet'] = test['tweet'].map(lambda x: re.sub(r"don\'t", "do not", str(x)))
test['tweet'] = test['tweet'].map(lambda x: re.sub(r"dont", "do not", str(x)))
test['tweet'] = test['tweet'].map(lambda x: re.sub(r"n\’t", " not", str(x)))
test['tweet'] = test['tweet'].map(lambda x: re.sub(r"n\'t", " not", str(x)))
test['tweet'] = test['tweet'].map(lambda x: re.sub(r"\'re", " are", str(x)))
test['tweet'] = test['tweet'].map(lambda x: re.sub(r"\'s", " is", str(x)))
test['tweet'] = test['tweet'].map(lambda x: re.sub(r"\’d", " would", str(x)))
test['tweet'] = test['tweet'].map(lambda x: re.sub(r"\d", " would", str(x)))
test['tweet'] = test['tweet'].map(lambda x: re.sub(r"\'ll", " will", str(x)))
test['tweet'] = test['tweet'].map(lambda x: re.sub(r"\'t", " not", str(x)))
test['tweet'] = test['tweet'].map(lambda x: re.sub(r"\'ve", " have", str(x)))
test['tweet'] = test['tweet'].map(lambda x: re.sub(r"\'m", " am", str(x)))
test['tweet'] = test['tweet'].map(lambda x: re.sub(r"\n", "", str(x)))
test['tweet'] = test['tweet'].map(lambda x: re.sub(r"\r", "", str(x)))
test['tweet'] = test['tweet'].map(lambda x: re.sub(r"[0-9]", "digit", str(x)))
test['tweet'] = test['tweet'].map(lambda x: re.sub(r"\'", "", str(x)))
test['tweet'] = test['tweet'].map(lambda x: re.sub(r"\"", "", str(x)))
test['tweet'] = test['tweet'].map(lambda x: re.sub(r'[?|!|\'|"|#]',r'', str(x)))
test['tweet'] = test['tweet'].map(lambda x: re.sub(r'[.|,|)|(|\|/]',r' ', str(x)))
test['tweet'] = test['tweet'].apply(lambda x: remove_stopwords(x))

test.tail(10)

Unnamed: 0,id,tweet
1943,9864,men the top gummers mommyhood iphone life inst...
1944,9865,am thoroughly addicted the angrybirds game con...
1945,9866,girl brazil life galaxys would samsung saturda...
1946,9867,whoop raj bigbangtheory not alone secret siri ...
1947,9868,you camera everytime look you smile boredom ca...
1948,9869,samsunggalaxynote would explodes burns would y...
1949,9870,available hoodie check out http zetasupplies c...
1950,9871,goes crack right across screen you could actua...
1951,9872,codeofinterest said adobe big time may well in...
1952,9873,finally got thanx father samsung galaxy would ...


### Normalization

In [None]:
# Can try lemmatization  1) using nltk.stem wordnet 
# Can word2Vec

In [52]:
import nltk
# Lemmatize with POS Tag
from nltk.corpus import wordnet

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

In [62]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
def get_normalize(string):
    # 1. Init Lemmatizer
    lemmatizer = WordNetLemmatizer()
    
    # 2. Lemmatize Single Word with the appropriate POS tag
    #word = 'feet'
    #print(lemmatizer.lemmatize(word, get_wordnet_pos(word)))

    # 3. Lemmatize a Sentence with the appropriate POS tag
    #sentence = "The striped bats are hanging on their feet for best"
    #print('Example:')
    #print([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in nltk.word_tokenize(sentence)])
    #> ['The', 'strip', 'bat', 'be', 'hang', 'on', 'their', 'foot', 'for', 'best']
    new_list = [lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in nltk.word_tokenize(string)]
    return ' '.join(new_list)

In [63]:
train['tweet'].head(5)

0    fingerprint pregnancy test goo gl h wouldmfqv ...
1    finally transparant silicon case thanks my unc...
2    love would go talk makememories unplug relax i...
3    wired know george made way iphone cute daventr...
4    amazing service apple even talk me question un...
Name: tweet, dtype: object

In [64]:
train['tweet'][4]

'amazing service apple even talk me question unless pay would would would would their stupid support'

In [65]:
get_normalize(train['tweet'][4])

'amaze service apple even talk me question unless pay would would would would their stupid support'

In [66]:
train['tweet1'] = train['tweet'].apply(lambda x: get_normalize(x))

In [67]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer, TfidfTransformer

In [68]:
vec = TfidfVectorizer(min_df=3, max_features=None, strip_accents='unicode', analyzer='word',
                      token_pattern=r'\w{1,}', ngram_range=(1, 4), use_idf=1,smooth_idf=1,
                      sublinear_tf=1, stop_words = 'english')

train_tfidf = vec.fit_transform(train['tweet1'])

# TF-IDF
#tfidf_transformer = TfidfTransformer()
#X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
print(train_tfidf.shape)

(7920, 11084)


In [69]:
test_tfidf = vec.transform(test['tweet'])
print(test_tfidf.shape)

(1953, 11084)


In [None]:
vec = TfidfVectorizer(min_df=3, max_features=None, strip_accents='unicode', analyzer='word',
                      token_pattern=r'\w{1,}', ngram_range=(1, 4), use_idf=1,smooth_idf=1,
                      sublinear_tf=1, stop_words = 'english')

X_train_tfidf = vec.fit_transform(train['tweet'])

# TF-IDF
#tfidf_transformer = TfidfTransformer()
#X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
print(X_train_tfidf.shape)

In [None]:
X_test_tfidf = vec.transform(test['tweet'])

print(X_test_tfidf.shape)

# Modeling

## Baseline - Logistic Regression

In [70]:
from sklearn.linear_model import LogisticRegression

In [80]:
lr_classifier = LogisticRegression(C=2.5,random_state=101)

In [81]:
lr_classifier.fit(train_tfidf, train['label'])



LogisticRegression(C=2.5, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=101, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [82]:
pred = lr_classifier.predict(test_tfidf)

In [83]:
sub = pd.read_csv('sample_submission.csv')

In [84]:
sub.head()

Unnamed: 0,id,label
0,7921,0
1,7922,0
2,7923,0
3,7924,0
4,7925,0


In [85]:
sub['label'] = pred

In [86]:
sub.head()

Unnamed: 0,id,label
0,7921,1
1,7922,0
2,7923,1
3,7924,1
4,7925,1


In [87]:
sub.to_csv('Submission009.csv',index=False)

In [None]:
from sklearn import metrics
from sklearn.metrics import accuracy_score, f1_score

In [None]:
for c in [10, 50, 100, 1000, 10000]:
    lr = LogisticRegression(C=c, random_state=101).fit(X_train_tfidf, y_train)
    #lr = LogisticRegression(C=c, random_state=2019).fit(X_train_tfidf, y_train)
    print ("f1 score for C=%s: %s" % (c, f1_score(y_val, lr.predict(X_val_tfidf))))

#Accuracy for C=50: 0.9627717816361645
#f1 score for C=1000: 0.999298245614035

In [None]:
log_reg = LogisticRegression(C=2.5,random_state=2019).fit(X_train_tfidf, target)

##### solver : str, {'newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'},              default: 'liblinear'.

    Algorithm to use in the optimization problem.

    - For small datasets, 'liblinear' is a good choice, whereas 'sag' and
      'saga' are faster for large ones.
    - For multiclass problems, only 'newton-cg', 'sag', 'saga' and 'lbfgs'
      handle multinomial loss; 'liblinear' is limited to one-versus-rest
      schemes.
    - 'newton-cg', 'lbfgs' and 'sag' only handle L2 penalty, whereas
      'liblinear' and 'saga' handle L1 penalty. 
      
      #   penalty='l2',
    dual=True,
    tol=0.0001,
    fit_intercept=True,
    intercept_scaling=1,
    class_weight=None, #dict or 'balanced', default: None
    solver='warn', #solver : str, {'newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'}, def= 'warn'
    max_iter=100,
    multi_class='warn',
    verbose=0,
    warm_start=False,
    n_jobs=None)

In [None]:
pred3 = log_reg.predict(X_test_tfidf)

In [None]:
sub3 = pd.read_csv('sample_submission.csv')

In [None]:
sub3['label'] = pred3

In [None]:
sub3.to_csv('Submission003.csv',index=False)

In [88]:
from sklearn.naive_bayes  import BernoulliNB

In [89]:
cls = BernoulliNB().fit(train_tfidf,train['label'])

In [90]:
pred5 = cls.predict(test_tfidf)

In [None]:
sub3 = pd.read_csv('sample_submission.csv')
sub3['label'] = pred5
sub3.to_csv('Submission003.csv',index=False)

In [None]:
from sklearn.svm  import SVC

In [None]:
cls = SVC(C=1,degree=1,gamma=1,kernel='sigmoid').fit(X_train_tfidf,train['label'])

In [None]:
pred5 = cls.predict(X_test_tfidf)

In [None]:
sub3 = pd.read_csv('sample_submission.csv')
sub3['label'] = pred5
sub3.to_csv('Submission003.csv',index=False)

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
# Create param grid
param_grid = {'C': [1], 'degree': [1], 'gamma': [1], 'kernel': ['sigmoid']}

In [None]:
grid = GridSearchCV(SVC(coef0=0.5),param_grid,refit=True,verbose=3)

In [None]:
grid.fit(X_train_tfidf,train['label'])

In [None]:
grid.best_params_

In [None]:
test['len'] = test['tweet'].apply(len)

In [None]:
test['len'].describe()

## Random Forest

In [None]:
from sklearn.ensemble  import RandomForestClassifier

In [None]:
rf_classifier = RandomForestClassifier(
 n_estimators= 300,
 min_samples_split= 10,
 min_samples_leaf= 4,
 max_features= 'sqrt',
 max_depth= 90,
 bootstrap= False)

In [None]:
rf_classifier.fit(X_train_tfidf, train['label'])

In [None]:
pred = rf_classifier.predict(X_test_tfidf)

In [None]:
sub = pd.read_csv('sample_submission.csv')

In [None]:
sub['label']=pred

In [None]:
sub.to_csv('Submission3.csv',index=False)

In [None]:
# parameter tuning

In [None]:
from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
print(random_grid)

In [None]:
# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = RandomForestClassifier()
# Random search of parameters, using 3 fold cross validation, 
# search across 18 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 9, cv = 3, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
rf_random.fit(X_train_tfidf, train['label'])

In [None]:
rf_random.best_params_

## XGBoost Forest

In [None]:
from xgboost  import XGBClassifier

In [None]:
xgb_classifier = XGBClassifier()

In [None]:
xgb_classifier.fit(tweet_tfidf, train['label'])

In [None]:
pred = xgb_classifier.predict(test_tweet_tfidf)

In [None]:
sub4 = pd.read_csv('sample_submission.csv')

In [None]:
sub4['label'] = pred

In [None]:
sub4.to_csv('Submission4.csv',index=False)