# NLP Techniques Lab

In this lab, we'll be practicing a set of advanced NLP techniques using tweets on airline satisfaction ([originally from Kaggle](https://www.kaggle.com/crowdflower/twitter-airline-sentiment/data)).

The first section asks you to perform LDA on the dataset to summarize the body of tweets. The second section will focus on using this data to predict the sentiment of a given tweet.

Import the data as follows:

In [2]:
import pandas as pd

df = pd.read_csv('datasets/Tweets.csv')
print(df.shape)
df.head()

(14640, 15)


Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


Use this data to do the following:

#### 1. Use LDA to identify topics in the tweets

Pick a number of topics between 5-20 and use LDA to summarize the corpus of tweets. Print out the top 25 most frequently occuring words in each topic. Do the topics appear cohesive to you? What predominant trends can you find?

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')
cv.fit(df['text'].values)
X = cv.transform(df['text'].values)

In [4]:
from sklearn.decomposition import LatentDirichletAllocation
feature_names = cv.get_feature_names()
lda = LatentDirichletAllocation(n_topics=5)

lda.fit(X)



LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7, learning_method=None,
             learning_offset=10.0, max_doc_update_iter=100, max_iter=10,
             mean_change_tol=0.001, n_jobs=1, n_topics=5, perp_tol=0.1,
             random_state=None, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)

In [5]:
results = pd.DataFrame(lda.components_,
                      columns=feature_names)

for topic in range(5):
    print('Topic', topic)
    word_list = results.T[topic].sort_values(ascending=False).index
    print(' '.join(word_list[0:25]), '\n')

Topic 0
usairways service united customer thanks great delay http worst days flight good amp experience airline line bad customers staff 10 really long boarding flights says 

Topic 1
americanair flight usairways united help hours hold delayed gate need plane bag time just late phone amp hour waiting change wait don ve trying got 

Topic 2
guys thank usairways want yes baggage dfw http fleet pay free dca seats finally really business class just wifi thx rep upgrade twitter seat reservations 

Topic 3
jetblue southwestair http thanks flight weather going minutes told like bags thank way fly new time love better times getting just know plane airlines don 

Topic 4
cancelled united flight flightled flights aa flighted jfk right ticket doesn thanks info refund dm response just sent book crew ord team flt look website 



#### Bonus LDA Question (Tackle if you have time / interest)

Using the `.transform()` method on LDA on the data you fed it will return back a numpy array of shape `(n_rows, n_topics)`. The value in each column will identify the probability that the row in question belongs to that topic. For example, if we were looking at a row of data and an LDA model for three topics, we might see the following:

```python
lda.transform(row_of_data)
>> [[ 0.02, 0.97, 0.01 ]]
```

This would suggest that for that row of data, it is most likely to be in the second topic (compared to the first or third topic).

As a bonus challenge, try the two following questions:

1. For each topic, which tweet most exemplifies (or is most likely to belong to that topic?)
2. Find a recent tweet at an airline that you have used. Can you use the model you have currently to identify what topic does it belongs to?

#### 2. Use NLP to predict the sentiment of tweets

In this section, please use any of the NLP techniques that we have covered over the last two days to best predict whether a tweet has a negative sentiment or not. Transformation code for your target variable is below.

**Bonus Consideration**: Outside of the text itself, do other factors in the dataset have an effect? Do your results change if you include features like the airline or the timezone of the tweet?

Don't forget to create a training and test set to compare your results. 

In [6]:
df['negative'] = df['airline_sentiment'].apply(lambda x: 1 if x == "negative" else 0)


In [7]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['text'],
                                                   df['negative'], test_size=0.33)


In [8]:
from nltk.stem import PorterStemmer
import string
from nltk.corpus import stopwords
def cleaner(text):
    stemmer = PorterStemmer()
    stop = stopwords.words('english')
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = text.translate(str.maketrans('', '', string.digits))
    text = text.lower().strip()
    final_text = []
    for w in text.split():
        if w not in stop:
            final_text.append(stemmer.stem(w.strip()))
    return ' '.join(final_text)

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline, make_union
from sklearn.ensemble import RandomForestClassifier

pipeline = make_pipeline(
    TfidfVectorizer(preprocessor=cleaner),
    TruncatedSVD(),
    RandomForestClassifier())

In [10]:
pipeline.fit(X_train, y_train)

Pipeline(steps=[('tfidfvectorizer', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2',
        preprocessor=<funct...imators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False))])

In [11]:
from sklearn.metrics import classification_report, confusion_matrix
print(pipeline.score(X_train, y_train))
predictions = pipeline.predict(X_train)
print(confusion_matrix(y_train, predictions))
print(classification_report(y_train, predictions))

print(pipeline.score(X_test, y_test))
predictions = pipeline.predict(X_test)
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))

0.918637846656
[[3283  429]
 [ 369 5727]]
             precision    recall  f1-score   support

          0       0.90      0.88      0.89      3712
          1       0.93      0.94      0.93      6096

avg / total       0.92      0.92      0.92      9808

0.681084437086
[[ 965  785]
 [ 756 2326]]
             precision    recall  f1-score   support

          0       0.56      0.55      0.56      1750
          1       0.75      0.75      0.75      3082

avg / total       0.68      0.68      0.68      4832

