# Sentiment Analysis

## Group Names and Roles

- Partner 1 (Role)
- Partner 2 (Role)
- Partner 3 (Role)

## Introduction

In this problem, you will study the sentiment in a data set of tweets collected during the COVID-19 pandemic. To load the data set, run this block: 

In [None]:
import pandas as pd

def grab_tweets():
    """
    The user supplied these data already split into training and test sets. 
    This function downloads and combines them, returning a single data frame.
    No arguments. 
    """

    url1 = "https://raw.githubusercontent.com/PhilChodrow/PIC16A/master/datasets/Corona_NLP_train.csv"
    url2 = "https://raw.githubusercontent.com/PhilChodrow/PIC16A/master/datasets/Corona_NLP_test.csv"
    
    df1 = pd.read_csv(url1, encoding='iso-8859-1')
    df2 = pd.read_csv(url2, encoding='iso-8859-1')
    
    return pd.concat((df1, df2), axis = 0).reset_index().drop("index", axis = 1)

df = grab_tweets()

### §1.Take a Look 

Briefly inspect the data `df`. The `OriginalTweet` column contains the text of the tweet, and the `Sentiment` column contains a text label describing the sentiment of the tweet as described by a crowdsourced worker. 

Create a `pandas` summary table showing how many tweets there are for each sentiment. 

**Hint**: `groupby().size()`. 

### §2. Create the Term-Document Matrix

Now, use the `CountVectorizer` from `sklearn.feature_extraction.text` to construct a term-document matrix. You should add the columns of this matrix to the original `df`, resulting in a single data frame that contains both the word counts and the sentiments. 

I constructed my `CountVectorizer` with these settings:

```CountVectorizer(max_df = 0.2, min_df = 50, stop_words = 'english')```

However, you are free to experiment with different ones if you'd like to. You may find it useful to consult code from the [lecture on the term-document matrix](https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/NLP/NLP_1.ipynb) or the [lecture on sentiment analysis](https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/NLP/NLP_3.ipynb) for this part.  

### §3.Train-Test Split

Split the data into training and test sets. Use 50% of the data for training and 50% for testing. 

### §4.Prepare Predictor and Target Variables

So far, we've split our data into training and test sets, but we also need to separate out the predictor and target variables. We should also make sure that both sets of variables are encoded as numeric columns that a machine learning algorithm can understand. 

Write a function called `prepare_variables` to perform these tasks. It should have two return values: 

- `X`, the predictor columns. I suggest dropping all columns except the ones that you constructed as part of the term-document matrix. You can do this using `df.drop(list_of_columns, axis = 1)`. 
- `y`, the target column. There are multiple valid ways to construct `y` from the original `Sentiment` column of the data. For this activity, `y[i]` should be equal to `1` if `df["Sentiment"][i]` contains the word `"Positive"` and `0` otherwise. 
    - ***Hint***: `df["column"].str.contains("word")`

After you've defined your function, use it to prepare the training and test sets:  

```
X_train, y_train = prepare_variables(train) 
X_test,  y_test  = prepare_variables(test)
```

Check these sets to ensure that your function does what you'd expect. 

### §5. Fit the model

Fit a logistic regression model to the training data. It is possible to perform cross-validation to set model complexity parameters, but we won't worry about that today. 

You may receive a warning indicating that "TOTAL NO. of ITERATIONS REACHED LIMIT." This is a little bit intimidating, but in this case it's not really harmful. It's fine to ignore this warning, or you can set the parameter `max_iter` when you construct the `LogisticRegression()` model to a higher value, such as `500` or `1000`. This may increase the amount of time required to fit your model. 

Then, evaluate the accuracy of your model on the training and testing data. You may observe a small amount of overfitting, indicated by the testing score being lower than the training score. Provided that both are over 80%, you're doing fine. 

### §6. Positive and Negative Words

The *coefficients* associated by logistic regression to each word are stored in the `LR.coef_[0]` attribute. Show lists of the most positive and most negative words. Discuss your findings -- do these words look as though they would reasonably convey positive or negative sentiment?

***Hint***: *There are several good ways to do this. I found that the easiest way was to create a new `Dataframe` containing `LR.coef_[0]` in one column and `X_train.columns` (the list of words) in the other column. I then used `df.sort_values()` to sort on the column containing coefficients. The [lecture on sentiment analysis](https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/NLP/NLP_3.ipynb) illustrates this method.*

In [None]:
# negative words



In [None]:
# positive words



### §7. Inspect Errors

Finally, we should take a look at a few examples in which the model made a mistake on the test data. Inspect three tweets in which the model's prediction and the actual value differ. Comment on why you think the error may have occurred. 

**Don't overthink it!** It's sufficient to simply identify three rows in which the prediction disagrees with the true label. You shouldn't need more than 3-4 lines. 

***Note***: *One valid response is to disagree with the sentiment label assigned to the tweet -- some of them are a little odd.*