### Codio Activity 18.3: Stemming and Lemmatization

In this activity, you will stem and lemmatize a text to normalize a given text.  Here, you will review using the lemmatizer and stemmer on a basic list and then turn to data in a DataFrame, writing a function to apply the lemmatization and stemming operations to a column of text data.  The data is the WhatsApp status dataset from kaggle, and you will focus on the `content` feature.

- [Problem 1](#-Problem-1)
- [Problem 2](#-Problem-2)
- [Problem 3](#-Problem-3)
- [Problem 4](#-Problem-4)
- [Problem 5](#-Problem-5)

In [4]:
import pandas as pd
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import word_tokenize
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')

[nltk_data] Downloading package wordnet to /Users/kellen/nltk_data...
[nltk_data] Downloading package omw-1.4 to /Users/kellen/nltk_data...
[nltk_data] Downloading package punkt to /Users/kellen/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

#### The Data

The text data again comes from [kaggle](https://www.kaggle.com/datasets/sankha1998/emotion?select=Emotion%28sad%29.csv) and is related to classifying WhatsApp status. We load in only the "angry" sentiment below.


In [6]:
angry = pd.read_csv('codio_18_3_solution/data/Emotion(angry).csv')

In [7]:
angry.head()

Unnamed: 0,content,sentiment
0,"Sometimes I’m not angry, I’m hurt and there’s ...",angry
1,Not available for busy people☺,angry
2,I do not exist to impress the world. I exist t...,angry
3,Everything is getting expensive except some pe...,angry
4,My phone screen is brighter than my future 🙁,angry


### Problem 1

#### Stemming a list of words

Use `PorterStemmer` to stem the different variations on the word "compute" in the list `C` below.  Assign your results to the list `stemmed_words` below. 

In [8]:
C = ['computer', 'computing', 'computed', 'computes', 'computation', 'compute']

In [9]:
stemmer = PorterStemmer()

In [10]:
stemmed_words = [stemmer.stem(i) for i in C]

In [11]:
print(type(stemmed_words))
print(stemmed_words)

<class 'list'>
['comput', 'comput', 'comput', 'comput', 'comput', 'comput']


### Problem 2

#### Lemmatizing a list of words

Use `WordNetLemmatizer` to stem the different variations on the word "compute" in the list `C` below.  Assign your results to the list `lemmatized_words` below. 

In [12]:
lemma = WordNetLemmatizer()

In [13]:
lemmatized_words = [lemma.lemmatize(w) for w in C]

In [14]:
print(type(lemmatized_words))
print(lemmatized_words)

<class 'list'>
['computer', 'computing', 'computed', 'computes', 'computation', 'compute']


### Problem 3

#### Which performed better

Assuming we wanted all the words in `C` to be normalized to the same word, which worked better to this goal -- stemming or lemmatizing.  Assign your response as a string -- `stem` or `lemmatize` -- to `ans3` below.

In [15]:
ans3 = 'stem'

### Problem 4

#### A function for stemming

Use `PorterStemmer` to complete the function `stemmer` below. This function should take in a string of text and return a string of stemmed text. Note that you will need to tokenize the text before stemming and should return a single string.  

Hint: use the `join` method

In [16]:
def stemmer(text):
    '''
    This function takes in a string of text and returns
    a string of stemmed text.
    
    Arguments
    ---------
    text: str
        string of text to be stemmed
        
    Returns
    -------
    str
       string of stemmed words from the text input
    '''
    return ''
    

In [21]:
def stemmer(text):
    lst = word_tokenize(text)
    stem = PorterStemmer()
    stem_lst = [stem.stem(i) for i in lst]
    return ' '.join(stem_lst)
    

In [19]:
text = 'The computer did not compute the answers correctly.'

In [22]:
stemmer(text)

'the comput did not comput the answer correctli .'

In [23]:
def stemmer(text):
    stem = PorterStemmer()
    return ' '.join([stem.stem(w) for w in word_tokenize(text)])
    
text = 'The computer did not compute the answers correctly.'
print(text)
print(stemmer(text))#should return --> the comput did not comput the answer correctli .

The computer did not compute the answers correctly.
the comput did not comput the answer correctli .


### Problem 5

#### Using the stemmer on a DataFrame

Use your function `stemmer` to apply to the `content` feature of the DataFrame `angry`.  Assign the resulting series to `stemmed_content` below.

Hint: use the `.apply` method

In [26]:
stemmed_content = angry['content'].apply(stemmer)
print(type(stemmed_content))
print(stemmed_content.head())

<class 'pandas.core.series.Series'>
0    sometim i ’ m not angri , i ’ m hurt and there...
1                           not avail for busi people☺
2    i do not exist to impress the world . i exist ...
3    everyth is get expens except some peopl , they...
4          my phone screen is brighter than my futur 🙁
Name: content, dtype: object


### Codio Activity 18.4: Bag of Words: Count Vectorization

In this activity you will use the scikit-learn vectorization tool `CountVectorizer` to create a bag of words representation of text in a DataFrame.  You will explore how different parameter settings affect the performance of a `LogisticRegression` estimator on a binary classification problem.

- [Problem 1](#-Problem-1)
- [Problem 2](#-Problem-2)
- [Problem 3](#-Problem-3)
- [Problem 4](#-Problem-4)
- [Problem 5](#-Problem-5)

In [27]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error

### The Data

Below, the data from kaggle is again loaded.  Now, we join the "sad" and "happy" sentiments which will form the target of our classification models.  The data is also split and named appropriately below. 

In [29]:
happy_df = pd.read_csv('codio_18_4_solution/data/Emotion(happy).csv')
sad_df = pd.read_csv('codio_18_4_solution/data/Emotion(sad).csv.zip', compression = 'zip')

In [30]:
full_df = pd.concat([happy_df, sad_df]).reset_index(drop = True)

In [31]:
X = full_df.drop('sentiment', axis = 1)
y = full_df['sentiment']

In [32]:
X_train, X_test, y_train, y_test = train_test_split(X['content'], y, random_state = 42)

In [33]:
X_train.head()

1287    ['You Hurt Me But I Still Love You.', 'True Lo...
1112    Sorry isn’t always enough. Sometimes you actua...
823     Sometimes two people have to fall apart to rea...
651     True love isn’t love at first sight but love a...
1101    i am scared of getting too close to anyone bec...
Name: content, dtype: object

### Problem 1

#### Using the `CountVectorizer`

To create a bag of words representation of your text data, create an instance of the `CountVectorizer` as `cvect` below.  Leave all the default settings, and assign the transformed version of the text to `dtm`.  Note that because the vectorizer will return a `scipy.sparse` array, to view the contents of the resulting document term matrix the `toarray()` function is used together with the `.get_feature_names()` function to retrieve the fitted vocabulary.

Hint: Make sure to transform X_train

In [34]:
cvect = CountVectorizer()

In [35]:
dtm = cvect.fit_transform(X_train)

In [36]:
pd.DataFrame(dtm.toarray(), columns = cvect.get_feature_names_out()).head()

Unnamed: 0,0_0,100,123whatsappstatus,204,30,404,44,45,55,805,...,yes,yesterday,yet,you,young,your,yours,yourself,yous,yuh
0,0,0,0,0,0,0,0,0,0,0,...,2,0,1,112,0,13,0,2,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Problem 2

#### Limiting words with the `CountVectorizer`

Now, to remove stopwords from the text before vectorizing create a new instance of the `CountVectorizer` and set `stop_words = 'english'` to remove the english language stop words using the same list as in our earlier assignment.  Fit and transform the training data and transform the test data as `X_train_vect_2` and `X_test_vect_2` below.

Hint: Use `fit_transform` for the training data, and `transform` for the test data.

In [37]:
cvect2 = CountVectorizer(stop_words = 'english')
X_train_vect_2 = cvect2.fit_transform(X_train)
X_test_vect_2 = cvect2.transform(X_test)

X_train_vect_2

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 41589 stored elements and shape (1007, 1622)>

### Problem 3

#### Limiting words with stopwords and higher counts

Now, remove stopwords using `stop_words = 'english'` and limit the features to the top 300 words based on counts using the `max_features` argument.  Fit and transform your data appropriately as `X_train_vect_3` and `X_test_vect_3` below.

In [38]:
cvect3 = CountVectorizer(stop_words = 'english', max_features = 300)
X_train_vect_3 = cvect3.fit_transform(X_train)
X_test_vect_3 = cvect3.transform(X_test)

X_train_vect_3

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 33225 stored elements and shape (1007, 300)>

### Problem 4

#### Using the text with `LogisticRegression`

Create a `Pipeline` object named `vect_pipe_1` below that has steps named `cvect` and `lgr`, using both a default `CountVectorizer` transformer and `LogisticRegression` estimator. Fit this on the training data and evaluate it on the test set. 

In [39]:
vect_pipe_1 = Pipeline([('cvect', CountVectorizer()),
                       ('lgr', LogisticRegression())])
vect_pipe_1.fit(X_train, y_train)
test_acc = vect_pipe_1.score(X_test, y_test)

vect_pipe_1.named_steps

{'cvect': CountVectorizer(), 'lgr': LogisticRegression()}

### Problem 5

#### Pipeline and Grid Search

Finally, to abstract this work into a single step you can create a `Pipeline` with named steps `cvect` and `lgr` below that vectorize and model the data.  Then, use the parameter grid to perform a grid search for the ideal parameters to represent the text and build a classification model. 

Hint: Use vect_pipe_1 from problem 4

In [40]:
params = {'cvect__max_features': [100, 500, 1000, 2000],
         'cvect__stop_words': ['english', None]}

In [41]:
grid = GridSearchCV(vect_pipe_1, param_grid=params)
grid.fit(X_train, y_train)
test_acc = grid.score(X_test, y_test)

grid.best_params_

{'cvect__max_features': 2000, 'cvect__stop_words': None}