# Basic Text Representation
The Twitter dataset (`tweets.csv`) was scraped from February of 2015. Contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service"). The dataset can be found [here.](https://www.kaggle.com/crowdflower/twitter-airline-sentiment)

We are looking for all the tweets that are related to `bad catering service`. Use basic vectorization approaches and a similarity measure like cosine similarity to rank tweets based on their closeness to this topic. Check scikit-learn documentations.

## Importing Modules

In [36]:
import pandas
pandas.set_option("max_colwidth", None)
import sklearn.feature_extraction.text
import sklearn.metrics.pairwise

## Loading the Dataset

In [37]:
df = pandas.read_csv("../../datasets/tweets.csv")
df.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials to the experience... tacky.,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I need to take another trip!,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,"@VirginAmerica it's really aggressive to blast obnoxious ""entertainment"" in your guests' faces &amp; they have little recourse",,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing about it,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


## Representing Texts

### Bag of Words

In [25]:
vectorizer = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(1, 1), min_df=100)
vectorizer.fit(df["text"])
documents = vectorizer.transform(df["text"])
query = vectorizer.transform(["bad catering service"])

### Bag of N-Grams

In [22]:
vectorizer = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(1, 3), min_df=100)
vectorizer.fit(df["text"])
documents = vectorizer.transform(df["text"])
query = vectorizer.transform(["bad catering service"])

### TF-IDF

In [38]:
vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(min_df=100)
vectorizer.fit(df["text"])
documents = vectorizer.transform(df["text"])
query = vectorizer.transform(["bad catering service"])

## Ranking Documents

In [42]:
similarities = sklearn.metrics.pairwise.cosine_similarity(query, documents)[0]
result = pandas.DataFrame({"Tweets": df["text"], "Similarity": similarities})
result.sort_values(by="Similarity", ascending=False).head(10)

Unnamed: 0,Tweets,Similarity
11635,@USAirways bad weather shouldn't mean bad service,0.86744
10923,@USAirways has completely wasted my work day! Thank you for the bad service and bad Friday!,0.691778
12737,"@AmericanAir SO BAD service in Miami, AirPort..",0.673601
5932,@SouthwestAir following!! My bad.,0.670176
7909,@JetBlue oh. Makes sense. My bad.,0.666328
6116,@SouthwestAir I receive bad customer service and ended up spending several hundred dollars to accommodate my family during each cxl flight,0.640216
2571,@united bad customer service to NYC a few weeks ago. Thinking of moving on,0.612229
11123,"@USAirways your app is bad, and you should feel bad.",0.610956
9199,@USAirways cust svc means nothing! So disappointed. Trying since 730a to speak to a human I get bad weather not bad service.\n#socialtantrum,0.605915
10928,"@USAirways that the flight was delayed &amp; closed the doors only to tell us we would not be leaving for an hour and a half. Bad, bad service",0.554856
