# Information Retrieval

In this workbook, we will use the vector space model and tf-idf to build a system that is able to query a corpus of tweets.

First let's read in the data set. The relevant columns for us will be: `airline` (the airline that was mentioned in the tweet) and `text` (the tweet itself).

In [5]:
import numpy as np
import pandas as pd

data = pd.read_csv("/data/airline-tweets.csv")
data.head(n=3)

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)


**Step 1.** Write some code to convert each tweet to a list of words. You will need to normalize this list of words. Also, make sure you remove hashtags (words beginning with `#`) and mentions (words beginning with `@`).

In [6]:
def get_normalized_words(text):
    split = text.lower().split()
    normal = []
    for word in split:
        if word[0] == "#" or word[0] == "@":
            continue
        normalized_word = word.lower().rstrip(",.?!") 
        if all(char.isalpha() for char in normalized_word):
            normal.append(normalized_word)
    return normal
            
documents = data["text"].apply(get_normalized_words)

In [7]:
documents.head()

0                                         [what, said]
1    [plus, added, commercials, to, the, experience...
2    [i, today, must, mean, i, need, to, take, anot...
3    [really, aggressive, to, blast, obnoxious, in,...
4         [and, a, really, big, bad, thing, about, it]
Name: text, dtype: object

**Step 2.** The vector space model represents a text by a vector of word counts. But to do this, we need to know what all the possible words are. Determine the **vocabulary**, the set of unique words that appear in this corpus. Throw away words that do not appear at least 10 times in the corpus.

In [3]:
vocab = {}
for doc in documents:
    for word in doc:
        if word not in vocab:
            vocab[word] = 1
        else:
            vocab[word] += 1

In [4]:
vocab = [word for word, count in vocab.items() if count >= 10]

In [5]:
len(vocab)

1725

**STOP! BEFORE MOVING ON, MAKE SURE THERE ARE FEWER THAN 2000 WORDS IN YOUR VOCABULARY. IF THERE ARE MORE, INCREASE THE MINIMUM COUNT FOR A WORD TO APPEAR IN THE VOCABULARY.**

**Step 3.** Write a function that takes in a list of words and returns a Pandas series representing how many times each word in our vocabulary appears in the list.

In [7]:
def convert_words_to_vector(words):
    vector = pd.Series(0,index=vocab, dtype=int)
    for word in words:
        if word in vocab:
            vector[word] += 1
    return vector

##word_vectors = documents.apply(convert_words_to_vector)
word_vectors = documents.apply(convert_words_to_vector).to_sparse(fill_value=0)

In [10]:
word_vectors.head()

Unnamed: 0,Unnamed: 1,sky,leaving,know,plus,blue,entire,girl,this,address,...,wont,print,worries,direct,stewardess,send,works,outside,booked,little
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


**Step 4.** Convert the term frequencies to tf-idf.

In [11]:
idf = np.log(len(word_vectors)/(word_vectors>0).sum())
tf_idf = word_vectors*idf

**Step 5.** Find the 10 tweets in the corpus that are closest (as measured by cosine similarity) to the headline "Passenger Is Dragged From an Overbooked Flight." What airline is represented most often in these tweets?

In [12]:
query_word = get_normalized_words("Passenger Is Dragged From an Overbooked Flight")
query_vector=convert_words_to_vector(query_word)*idf

In [13]:
dot_product=(tf_idf*query_vector).sum(axis=1)
length1=np.sqrt((tf_idf**2).sum(axis=1))
length2=np.sqrt((query_vector**2).sum())

In [14]:
cos_sim=dot_product/(length1*length2)

In [15]:
cos_sim.sort_values(ascending=False).head(5)

1985     0.484210
5074     0.449461
5028     0.442914
5018     0.405606
10622    0.401838
dtype: float64

In [16]:
top10=cos_sim.sort_values(ascending=False).index[:10]
for tweet in data.loc[top10, 'text']:
    print(tweet)

@united overbooked by FIFTY people?!? the worst.
@SouthwestAir flying flight 3130 tomorrow at 7:20 from PBI- I have boarding position C-42. Is flight overbooked? Have funeral to attend!
@SouthwestAir flying flight 3130 tonight at 7:20 from PBI- I have boarding position C-42. Is flight overbooked? Have funeral to attend!
@SouthwestAir flying flight 3130 tonight at 7:20 from PBI- I have boarding position C-42. Is it overbooked? Really don't want to be bumped!
@USAirways bumping people off a flight ten minutes before takeoff because the flight is overbooked #fail
@United is an airline where you pay extra to get a better seat but by the time you board your overbooked flight, there's no overhead space.
@united that is not in line with your responses here. And now I'm waiting until tomorrow morning because all the flights are overbooked.
@AmericanAir Still waiting on bags from flight 1613/2440 yesterday  First Class passenger not happy with your service.
@united - I think she was having a ro

In [18]:
top10

Int64Index([1985, 5074, 5028, 5018, 10622, 1734, 934, 14390, 4050, 1786], dtype='int64')