# Basic Text Representation
The Twitter dataset (`tweets.csv`) was scraped from February of 2015. Contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service"). The dataset can be found [here.](https://www.kaggle.com/crowdflower/twitter-airline-sentiment)

We are looking for all the tweets that are related to `bad catering service`. Use basic vectorization approaches and a similarity measure like cosine similarity to rank tweets based on their closeness to this topic. Check scikit-learn documentations.

# Importing Modules

In [4]:
import pandas as pd
# pandas.set_option("max_colwidth", None)
from sklearn.feature_extraction.text import CountVectorizer
import sklearn.metrics.pairwise

# Loading the Dataset

In [5]:
df = pd.read_csv('../../datasets/tweets.csv')
df = df.set_index('tweet_id')
print(df.shape)
df.head(3) 

(14640, 14)


Unnamed: 0_level_0,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)


# Representing Texts


### Bag of Words

In [6]:
vectorizer = CountVectorizer(ngram_range=(1, 1), min_df=100)
vectorizer.fit(df["text"])
documents = vectorizer.transform(df["text"])
query = vectorizer.transform(['bad catering service'])


### Bag of N-Grams

In [8]:
vectorizer = CountVectorizer(ngram_range=(1, 3), min_df=100)
vectorizer.fit(df["text"])
documents = vectorizer.transform(df["text"])
query = vectorizer.transform(['bad catering service'])

### TF-IDF

In [9]:
vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(min_df=100)
vectorizer.fit(df["text"])
documents = vectorizer.transform(df["text"])
query = vectorizer.transform(['bad catering service'])

### Ranking Documents

In [10]:
similarities = sklearn.metrics.pairwise.cosine_similarity(query, documents)[0]
result = pd.DataFrame({"Tweets": df["text"], "Similarity": similarities})
result.sort_values(by="Similarity", ascending=False).head(10)

Unnamed: 0_level_0,Tweets,Similarity
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1
567859888480772096,@USAirways bad weather shouldn't mean bad service,0.86744
568781427636092928,@USAirways has completely wasted my work day! ...,0.691778
570042096242917377,"@AmericanAir SO BAD service in Miami, AirPort..",0.673601
568478536715120640,@SouthwestAir following!! My bad.,0.670176
568993773277069312,@JetBlue oh. Makes sense. My bad.,0.666328
568212893012860928,@SouthwestAir I receive bad customer service a...,0.640216
569035887956271104,@united bad customer service to NYC a few wee...,0.612229
568491905903939584,"@USAirways your app is bad, and you should fee...",0.610956
570072227942674432,@USAirways cust svc means nothing! So disappoi...,0.605915
568779311261622272,@USAirways that the flight was delayed &amp; c...,0.554856
