**EDA Analysis over NLP Raw Data**

**Package and Data Import**

In [1]:
### Packages to Import
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from nltk.tokenize import TweetTokenizer
import string
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
nltk.download('wordnet')
import nltk
import ssl

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\moore\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
### Import Data and Change Column Names

df = pd.read_csv("data/tweets.csv")
df.columns = ['text', 'device', 'emotion']

**Data Cleaning Prior to Train / Test Split**

In [3]:
### Deliniating between Google and Apple

google_tweets = ['Google', 'Other Google product or service', 'Andriod App', 'Andriod']
apple_tweets = ['Apple', 'Other Apple product or service', 'Apple App', 'iPhone', 'iPad', 'iPad or iPhone App']

### Creating a new column for google vs. apple vs. unknown

df['device_type'] = np.where(df['device'].isin(google_tweets), 'Google', np.where(df['device'].isin(apple_tweets), 'Apple', 'Unknown'))

### Dropping 'I can't tell' and 'Other' rows

df = df[df['emotion'] != "I can't tell"]

### Dropping blank 'text' rows

df = df.dropna(subset=['text'])


**Performing Train / Test Split**

In [4]:
### Performing a train/test split

X = df.drop('emotion', axis=1)
y = df['emotion']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1337)

**Creating Functions that Clean and Tokenize the Text**

In [5]:
### Creating a cleaning text function that removes words that begin with @, #, or http, removes punctuation, removes stopwords, lemmatizes the tokens, and makes all text lowercase

def clean_text(text):
    tknzr = TweetTokenizer()
    tokens = tknzr.tokenize(text)
    tokens = [word for word in tokens if word[0] not in ['#', '@', 'h']]
    tokens = [word for word in tokens if word not in string.punctuation]
    tokens = [word for word in tokens if word not in stopwords.words('english')]
    tokens = [word.lower() for word in tokens]
    return tokens

### Creating a function that includes the 'clean_text' function and then lemmatizes the tokens in the 'text' column

def lemmatize_text(text):
    lemmatizer = WordNetLemmatizer()
    tokens = clean_text(text)
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return tokens

**Applying the lemmatize function to the training data**

In [6]:
### Applying the 'lemmatize_text' function to the 'text' column

X_train['text'] = X_train['text'].apply(lemmatize_text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train['text'] = X_train['text'].apply(lemmatize_text)


**Vectorize the text column**