## This is an individual assignment. Do not work in groups, and do not consult each other when doing this midterm.

# Task 1 - Getting tweets (3 points)

Pick either two Twitter usernames. These usernames should be "oppositional" in nature. I'm leaving the exact meaning of the term "oppositional" intentional vague; the two usernames you pick should be in the same domain (politics, products, pop culture, etc.) but should also be strong contrasts or even rivals. For example:  @Apple vs. @SamsungMobile, @SenMajLdr vs. @SenSchumer, @CocaCola vs. @pepsi, @OfficialKanye vs. @taylorswift13. Basically, if the two handles/hashtags you pick can sensibly fit into the phrase "A vs. B," then they're oppositional.

In the code block below, use the twitter API to get **the text** of 500 tweets representing the two usernames you pick - specifically, the 500 most recent tweets of the users. Please pick users that tweet primarily in English. This "English" requirement is simply so the two instructors can go through the information. This means you'll have a total of 1000 tweets.

**Do NOT pick any of the examples I gave above above as your two oppositional usernames. Think up an original pair. Note that this task requires you to come up with two "oppositional" entities as well as to get those entities most recent tweets. Keep in mind that this is an individual assignment. Any two students who happen to pick the same two oppositional entities may be double checked for cheating, therefore, the more "original" your oppositional pair, the better. Any two students who have exactly the same set of tweets will draw a high amount of suspicion, since it is highly unlikely students will a) pick the same entities and b) get their tweets at exactly the same time.**

In [8]:
#Save twitter credentials in variables over here
API_KEY = ""
API_SECRET = ""
import tweepy
auth = tweepy.AppAuthHandler(API_KEY, API_SECRET)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
#fetching tweets for both Southwest and United airlines form their user timelines
user1_tweets= tweepy.Cursor(api.user_timeline, id="SouthwestAir")
user2_tweets = tweepy.Cursor(api.user_timeline, id="united")
user_tweetstore = []
for status1,status2 in zip(user1_tweets.items(500),user2_tweets.items(500)):
    user_tweetstore.append((status1.text,0))
    user_tweetstore.append((status2.text,1))

## Task 1 Rubric

- code to obtain Tweets runs without errors (1 point)
- code correctly written to obtain the latest 500 tweets from a pair of usernames (1 point)
- student REMOVES their API key and secret before submitting the midterm!!! (1 point)

# Task 2 - Saving your data to an external file (3 points)

In the code block below, consolidate your tweets into a single variable. This variable should have two "columns," one for the text of the tweets, the other a binary indicator of the source (e.g. 0 for the first source, 1 for the second). You can use a list of tuples or A Pandas dataframe, and save as either a pickle (.pkl) or a comma-seperated values file (.csv).

In [9]:
import pandas as pd
import pickle as pkl
#Converting tweets from both the airlines into a dataframe with 0 for SouthwestAir and 1 for united.
Airline_Tweets = pd.DataFrame(user_tweetstore,columns=["Tweet","Category of Airlines"])
pkl.dump(Airline_Tweets, open("AirlineTweets.pkl", "wb"))
pkl.load(open("AirlineTweets.pkl", "rb"))

Unnamed: 0,Tweet,Category of Airlines
0,"@Morb1dlyObtuse Hey, there. Could you DM a scr...",0
1,"@HappyYugina Hello, please DM your confirmatio...",1
2,@kat_bates We don't like to hear of your conti...,0
3,"@MSIDarbs Have a great trip, David! ^EM",1
4,"@erik_pederson We know delays are no fun, Erik...",0
5,"@DJENNINGS15 Hi there. Yes, that's correct. ^CW",1
6,@Matt4Music Woohoo! That's what we love to hea...,0
7,@mr_audrey Hi Kate have you since been reunite...,1
8,"@itstreverr Hey, Trever. We appreciate your fe...",0
9,"@rcrain Hello Rhiannon, let's look at your res...",1


## Task 2 Rubric

- code to save data to external file runs without errors (1 point)
- saved data formatted correctly into "2 columns" (1 point)
- external file submitted with midterm (1 point)

# Task 3 - Preparing data for an sklearn binary classifier (3 points)

In the code block below, create two variables, X, and y. The y variable should be simple - it is simply the "second column" of the data you made in task 2, a binary indicator of source, with 0 representing one source and 1 representing the other.

For the X variable, choose either `TfidifVectorizer` or `CountVectorizer` from `sklearn.feature_representation.text` to turn the raw text (column 1 from task 2) into a "bag-of-words" representation. When instantiating your vectorizer, set the argument `lowercase=True`, to ensure that all words are lowercased, and `stopwords="english"`, to remove English stop words. 

Additional, when instantiating the vectorizer, pass the `max_df=???` and `min_df=???` arguments. These arguments can either take a *float between 0.0 and 1.0 or an integer*. The df stands for "document frequency." These arguments tell the vectorizer to remove words that occur *over* (max_df) and *under* (min_df) a certain amount of documents. This will remove frequent words - which show up all the time and therefore are not informative - and infrequent words, which are so rare as to just be noise. If you pass these arguments a float, that float represents the proportion of documents (e.g. `max_df=0.9` means, remove all words that show up in more than 90% of the tweets) and if you pass these arguments an integer, that integer represents the number of documents (e.g. `min_df=5`, remove all words that show up in less than 5 document). 

By the end of this task, you should have a variable X, of dimensionality $n \times d$ where $n = 1000$ and $d$ is the number of words left after the vectorizer considers df, and you should have variable y, which is a vector of length 1000, with 1s or 0s representing tweet source.

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer
tv = TfidfVectorizer(stop_words='english',lowercase=True,max_df=0.8,min_df=3)
X = tv.fit_transform(Airline_Tweets["Tweet"])
y=Airline_Tweets["Category of Airlines"]

## Task 3 Rubric

- Code runs without errors (1 point)
- Written code correctly achieves objective of creating X, y variables for classifier (1 point)
- All required arguments to vectorizer included (1 point)


# Task 4 - Training a Logistic Regression classifier (3 points)

Instantiate an sklearn Logistic Regression binary classifier ([sklearn documentation here](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)). 

Then, use `cross_val_score` from `sklearn.model_selection` ([sklearn documentation here](http://scikit-learn.org/stable/modules/cross_validation.html#computing-cross-validated-metrics))to perform 5-fold cross validation. The inputs to `cross_val_score` will be your instantiated Logistic Regression classifier, X, y, and a named argument `cv=5` to indicate the number of folds. The output will be a list of 5 numbers - the accuracy from each fold.

Print the average of those 5 numbers. This will be the mean 5-fold cross validation accuracy of your classifier.

In [11]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
#using logistic regression
clf = LogisticRegression()
#performing crossvalidation
scores = cross_val_score(clf, X, y, cv=5)
print(scores.mean())

0.92


## Task 4 Rubric

- Code runs without errors (1 point)
- Code successfully creates a Logistic Regression classifier and runs cross validation (1 point)
- Code prints mean cross validation accuracy (1 point)

# Task 5 - Discussion

Answer the following questions.

1 Since you pulled equal amounts of tweets from each source, the baseline accuracy is 50%. This is the accuracy we would expect from a classifier that guessed 0 or 1 randomly, or a classifier that simply guessed all 0s or all 1s. Your classifier either did well or did poorly. In either case, think about the *actual content* of the sources you picked and in the text block below, informally share your thoughts on why your classifier did poorly/well. (1 point)

Ans. The classifier performed well because we removed noise from the data like the stopwords and the most common and the least common words thus including only the relevant features which contribute significantly in predictions.

2 What could you have done differently when preprocessing your data (task 3) to try and improve your classifier's accuracy? (1 point)

Ans. We can fine tune the df_max and df_min parameters while preprocessing the data to get better classifier accuracy.We can also check stopwords for other languages to further remove the noise from the data.

3 What parameters could you have adjusted in the Logistic Regression classifier in Task 4 to "tune" it and get better performance? What other binary classifiers could you have used, and what "tune-able" parameters do those classifiers have? (1 point)

Ans. I think one potential way to further improve the accuracy of the model can be tuning the parameter C which is a positive number inversely proportional to the regularization strength i.e. lower the C stronger the regularization[1].
We could have used Support Vector Machine which has tunable parameters like C i.e. the penalty parameter of the error term and kernel type which is 'rbf' by default but can be tuned to other types as well[2].
# Reference:
[1] http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

[2] http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC

# Bonus (3 points)

This is an opportunity to gain an additional 3 points above the 15 allocated for the midterm. This will require you to do something that wasn't covered directly in the lectures, but can somewhat easily be learned by going through the sklearn documentation.

Note 1: When a count vectorizer or TF-IDF vectorizer is instantiated and used to transform your raw text data, it builds a dictionary that indicates which word is assigned to which index. Remember that it produces an $n \times d$ matrix, where $n$ is the number of samples and $d$ is the number of words. If you want to know the index of a word (that is, which column in $d$ corresponds to that word), you can consult this dictionary. Suppose you named your vectorizer `vec`. To access this dictionary, use `vec.vocabulary_`. If you want to know the index of the word `banana`, access `vec.vocabulary_['banana']`. 

Note 2: When you instantiate and train a logistic regression, it saves a set of *coefficients* indicating the "weight" of that word in terms of predicting the outcome variable. Suppose you named your classifier `lr`. You can access these coefficents at `lr.coef_[1]`. (the `[1]` is there because `lr.coef_[0]` is where the intercept of the model is stored). This means that `lr.coef_[1][0]` is the weight of the 0th feature, `lr.coef_[1][1]` is the weight of the 1st feature, and so on.

You can therefore *iterate* through `vec.vocabulary_.items()`, and for each word (key) get its index (value) and then find the coefficient weight of that word in the model `lr.coef_[1][index]`. 

In the code block (or blocks, if you want to make more than one to organize your code better) below:

1. Instantiate a *new* instance of a Logistic Regression classifier, `fit` that classifier on X and y. (1 point)
2. Use the notes above to make a list of tuples, where the first value in each tuple is a *word in the vocabulary* and the second value is the *coefficient weight assigned to that word in the trained Logistic Regression classifier*. Sort that list of tuples by the second value (the weight) ([Here's how you can do that](https://stackoverflow.com/questions/10695139/sort-a-list-of-tuples-by-2nd-item-integer-value)). (1 point)
3. Print the 10 words with the highest weights and the 10 words with the lowest weights. In a few sentences discuss whether these words help you understand why the model performed well/poorly. (1 point).

In [12]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=3057)
#Instantiated Logistic Regression classifier to fit on X and y
lr = LogisticRegression()
lr.fit(X_train, y_train)
weight=[]
#get words alongwith their weights 
for keys,value in tv.vocabulary_.items():
    weight.append((keys,lr.coef_[0][value]))
#sorted words on th basis of their weights    
sortedweights=sorted(weight, key=lambda weight: weight[1])
#print the 10 words with the lowest weights
print ("the 10 words with the lowest weights",sortedweights[0:10],"\n")
#print the 10 words with the highest weights
print ("10 words with the highest weights",sortedweights[-10:])

the 10 words with the lowest weights [('ms', -3.0556917730218784), ('ct', -2.4251110969922958), ('kd', -2.2529626546694348), ('vp', -2.1753336848987805), ('lj', -1.8893903382038322), ('onboard', -1.8725958264992673), ('ac', -1.8230466163576804), ('jt', -1.5340248554216607), ('mk', -1.4181223008180242), ('nc', -1.3145787759875553)] 

10 words with the highest weights [('dp', 1.8830109665928776), ('kf', 2.0170989026563229), ('y6hg6uklar', 2.2503818951031502), ('kl', 2.2736288919025731), ('em', 2.2762496643445829), ('sv', 2.3131771130156382), ('eb', 2.4362855978806643), ('cw', 2.5773109385964945), ('md', 2.6712757552519282), ('ad', 3.8400422563389491)]


The words with the highest weights signify that these are the most important features in our model whereas the words with the lowest weights signify that these are the least important features in our model and thus more important features contribute more significantly to the outcome of model than the least important one and thus the model did well.