# Partial Exam: Tweets Classification - Part III (25%)

**Main Topics:** Tweet Classification and N-Grams analysis

**Deadline:** April 15th 11:59 PM (Finnish Time)

**Author:** Andrés Felipe Zapata Palacio  

**Tasks:**  
* #1 (40 Points) Model Evaluation
* #2 (30 Points) Explore wrong Predictions
* #3 (30 Points) Filter Bigrams

🙋 If you have any question related to this assignment, you can write it filling the following form: https://forms.gle/6aw7hVH7fKhgRqGLA

## ⚠️ Important Information about the Submission Process ⚠️

You know the current restrictions that I have as Teacher at SAMK. Given that I don't have an institutional account, I cannot access in any way the Moodle platform. For this reason, I will use Dropbox to receive your exams. To do the following procedure you don't have to create a Dropbox account:

1. Enter into the following Link: https://www.dropbox.com/request/qwN0AQmZnnFap2piUYwK

2. Click on the button "Add Files", and then "Files from your Computer"

3. Upload your .ipynb file

4. If you are not logged in, the platform will ask you your name and your e-mail. Please, enter your Full Name and your institutional e-mail.


If you have any trouble uploading the files, you MUST contact me. Don't wait until the deadline finishes.

**Important Notes** ⚠️

* It's EXPLICITLY FORBIDDEN to use ChatGPT or any other Software to generate the answers or the analyses of the exam.

* You are allowed to modify only the parts of the code that are delimited by the commentaries. These sections usually have a commentary that says: "Write your code here".

* For the open questions, you have to explain your opinion or decision in detail. You must demonstrate that you dominate the topics of the course with your answer.

* Upload the Notebook in Dropbox using the format ipynb, you must export the notebook without removing the outputs from the cells.

* Verify that your notebook runs without errors before submitting it.

**Name of the Student (Penalization of 5 points)** ⚠️

One requirement for this assignment is to change the name of the file before uploading it to the system E.j (NLP_10_PartialExam_Andres_Zapata.ipynb). Additionally, you have to write your name in the space bellow:

```
Dawid Nalepa
```

If you don't do these two steps, your final score will be reduced in 5 points.



# Partial Exam

In [None]:
#@title Auxiliar Functions and Dependencies ⚠️
#@markdown ⚡ Run This cell to load the functions required for the exam, as well as all the dependencies and external libraries used in the process.



import nltk

# Tweet Sample Dataset
nltk.download('twitter_samples')

# POS Tagging
nltk.download('averaged_perceptron_tagger')
nltk.download('universal_tagset')

# Stop Words
nltk.download('stopwords')

# Numpy
import numpy as np

# Regular Expressions
import re

# DataFrames
import pandas as pd

# Math
import math

# Interactive Widgets
import ipywidgets as widgets
from ipywidgets import interact, interact_manual, FloatSlider, Layout

#Model Selection and Validation
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report,accuracy_score

def printTokensInVocabs(tokens):
  counters = {'CountVectorizer': tfCounter,'TF Normalized':tfNormalizedCounter,'TfIdfVectorizer':tfIdfCounter}
  for counterName in counters:
    counter = counters[counterName]
    newTokens = []
    for token in tokenizedTweet:
      if token in tfIdfCounter.vocabulary_:
        newTokens.append(token)
    print(f'Tokens in the Vocabulary of {counterName}: \t{newTokens}')

[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Unzipping corpora/twitter_samples.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Unzipping taggers/universal_tagset.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


## ☑️ Pre-Task 1.1: Load Dataset (0 Points)

Load the Positive and Negative Tweets, and generate the labels for each sample.

In [None]:
from nltk.corpus import twitter_samples

positiveTweets = twitter_samples.strings('positive_tweets.json')
negativeTweets = twitter_samples.strings('negative_tweets.json')

allTweets = []
allTweets.extend(positiveTweets)
allTweets.extend(negativeTweets)

nPositive = len(positiveTweets)
nNegative = len(negativeTweets)

positiveLabels = np.ones(nPositive)
negativeLabels = np.zeros(nNegative)

allLabels = []
allLabels.extend(positiveLabels)
allLabels.extend(negativeLabels)
allLabels = np.array(allLabels)

## ☑️ Pre-Task 1.2: Clean and Process Data (0 Points)

Define the different functions that will be used to clean and process the tweets, before transforming them into a numerical representation.

* **preprocessTweet()** receives a tweet (string) and returns a cleaned string (Removes URLs, e-mails, mentions and repeated spaces)

* **tokenizeTweet()** receives a tweet (string) and returns a list of tokens. This function splits the composed words. It uses the class TweetTokenizer, provided by NLTK library.

* **cleanTokens()** receives a list of tokens and removes the tokens that are not needed (single punctuations, numbers, clean hashtag symbol, turns to lowercase)

In [None]:
def preprocessTweet(tweet):
  tweet = re.sub('http[s]?://[\S]+', ' ', tweet)              # Remove URLs
  tweet = re.sub('[\w]+([._-]\w+)*@\w+([.]\w+)*', ' ', tweet) # Remove e-mails
  tweet = re.sub('@\S+','', tweet)                            # Remove mentions
  tweet = re.sub('\s+', ' ', tweet)                           # Replace repeated spaces to 1 single space
  return tweet

In [None]:
def cleanTokens(tokens):
  newTokens = []
  for token in tokens:
    token = token.lower()
    if re.match('^[_*#!$@<=^`>%&\'\"/()\[\]\-+,.:;?]$', token): # Remove tokens that are 1 single punctuation
      continue
    if re.match('\d+', token): # Remove Numbers
      continue
    if re.match('#[\w\d]+', token): # Remove Hashtag
      token = token[1:]
    newTokens.append(token)
  return newTokens

In [None]:
def splitTokens(tokens):
  splitPattern = r'(?<=[a-z])(?=[A-Z])'
  newTokens = []
  for token in tokens:
    pieces = re.split(splitPattern, token)
    newTokens.extend(pieces)
  return newTokens

In [None]:
from nltk.tokenize import TweetTokenizer

def tokenizeTweet(tweet):
  tokens = TweetTokenizer().tokenize(tweet)
  splittedTokens = splitTokens(tokens)
  cleanedTokens = cleanTokens(splittedTokens)
  return cleanedTokens

In [None]:
from nltk.corpus import stopwords

englishStopWords = stopwords.words('english')

## ☑️ Pre-Task 1.3: Generate Different Data Representation (0 Points)

Create three different numerical representations for the Tweets dataset:

1. Simple Term Frequency (Word Count) using CountVectorizer.
2. Normalized Term Frequency using TfIdfVectorizer disabling IDF.
3. TF-IDF Representation using TfIdfVectorizer.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [None]:
def buildVectorizers(max_features):
  # Term Frequency
  tfCounter = CountVectorizer(
    preprocessor = preprocessTweet,
    stop_words = englishStopWords,
    tokenizer = tokenizeTweet,
    max_features = max_features,
  )
  tfRepresentation = tfCounter.fit_transform(allTweets)

  # TF Normalized
  tfNormalizedCounter = TfidfVectorizer(
    use_idf = False, norm = 'l2', # This removes the IDF part
    preprocessor = preprocessTweet,
    stop_words = englishStopWords,
    tokenizer = tokenizeTweet,
    max_features = max_features,
  )
  tfNormalizedRepresentation = tfNormalizedCounter.fit_transform(allTweets)

  # TF-IDF Normalized
  tfIdfCounter = TfidfVectorizer(
    preprocessor = preprocessTweet,
    stop_words = englishStopWords,
    tokenizer = tokenizeTweet,
    max_features = max_features,
  )

  tfIdfRepresentation = tfIdfCounter.fit_transform(allTweets)

  return tfRepresentation, tfNormalizedRepresentation, tfIdfRepresentation, tfCounter, tfNormalizedCounter, tfIdfCounter

## ☑️ Pre-Task 1.4: Train Function (0 Points)

In [None]:
def trainAndEvaluate(tweets, labels):
  # Split Dataset in Train and Test
  X_train, X_test, y_train, y_test = train_test_split(tweets, labels, shuffle=True, random_state=10)

  # Build and Train the Model
  model = LogisticRegressionCV(max_iter=2000)
  model.fit(X_train,y_train)

  # Calculate Accuracy
  trainAcc = model.score(X_train, y_train)
  print(f'Train Accuracy: {trainAcc*100:.2f}%')
  testAcc = model.score(X_test, y_test)
  print(f'Test Accuracy: {testAcc*100:.2f}%\n')

  # Calculate other metrics
  tn, fp, fn, tp = confusion_matrix(labels, model.predict(tweets)).ravel()
  precision = tp / (tp + fp)
  sensitivity = tp / (tp + fn)
  specificity = tn / (tn + fp)
  print(f'Precision: {precision*100:.2f}%')
  print(f'Sensitivity: {sensitivity*100:.2f}%')
  print(f'Specificity: {specificity*100:.2f}%')

  # Return Variables
  results = {
      'model': model,
      'testAcc':testAcc,
      'trainAcc':trainAcc,
      'precision':precision,
      'sensitivity':sensitivity,
      'specificity':specificity
  }
  return results

## ☑️ Task 1: Model Evaluation (40 points)

There are three different Logistic Regression models, each one is trained using a different numerical representation of the tweets.

**🎯 Tasks:**

* Explore different values for max_features. This variable can have values between 1 and 12.010 (Number of different tokens in the dataset after cleaning). This variable defines the size of the vocabulary that will be used during the training. Find a value that gives you the highest test accuracy and minimizes overfitting. What happens with the test and train accuracy when you have a very small value (5,20)? What happens with the test and train accuracy when you have a very big value (10.000, 12.000)?

* Compare the performance of the three models and determine which of the three numerical representations is the best one for classifying Tweets in positive and negative using Logistic Regression.

* Take into account all the different metrics given (accuracy, precision, sensitivity, specificity), and compare the performance over Training data and Test Data.

In [None]:
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import confusion_matrix

In [None]:
#####   Write your code here   #####

# Max Features must be between 1 and 12010
max_features = 1000

####################################

tfRepresentation, tfNormalizedRepresentation, tfIdfRepresentation, tfCounter, tfNormalizedCounter, tfIdfCounter = buildVectorizers(max_features)
print('\n')
print('Model (1) Absolute Term Frequency (Word Count)\n')
tfResults = trainAndEvaluate(tfRepresentation, allLabels)
print('\n')
print('Model (2) Normalized Term Frequency (TF-IDF with IDF disabled)\n')
tfNormResults = trainAndEvaluate(tfNormalizedRepresentation, allLabels)
print('\n')
print('Model (3) TF-IDF Representation\n')
tfIdfResults = trainAndEvaluate(tfIdfRepresentation, allLabels)

pass





Model (1) Absolute Term Frequency (Word Count)

Train Accuracy: 99.97%
Test Accuracy: 99.52%

Precision: 99.74%
Sensitivity: 99.98%
Specificity: 99.74%


Model (2) Normalized Term Frequency (TF-IDF with IDF disabled)

Train Accuracy: 99.97%
Test Accuracy: 99.64%

Precision: 99.80%
Sensitivity: 99.98%
Specificity: 99.80%


Model (3) TF-IDF Representation

Train Accuracy: 99.91%
Test Accuracy: 99.84%

Precision: 99.84%
Sensitivity: 99.94%
Specificity: 99.84%


**🧩Hints:**
* Compare the test and the train accuracy to identify overfitting or underfitting.

* Test accuracy reflects how well can a model generalize the learning from training set in data that is completly new.

⁉️ **Question (40 Points)** 🧐

What did you observe with the different values of max_features? Did you observe overfitting or underfitting with any group of values?

```
When I have initially set the values between (5,20), I have
noticed that the model was underfitting the train and test accuracy.
In cases of setting th max_features to anything above 10, the train accuracy
would become higher than the test accuracy.

When setting the value of max_features between (10000,12000), I arrived to
a conclusion that the models are overfitting the train and test accuracy.
The train values were slowly getting closer to 100% meanwhile the test accuracy
was balancing itself within the 99.5% range, increasing by .1% on each model.
```

Which value for max_features gives you the best training results?

```
Setting max_features to 12,000 gave the best training results.
When I have set the max_features to 10,000 it gave a 100% training result on only the first 2 models.
Where as setting the max_features to 12,000 it gave 100% across all 3 models.
```

Which of the three models is the best one (TF, TF-normalized, TF-IDF)? You must give arguments based on the different metrics (precision, sensitivity, specificity) and you must compare also the Test and Train accuracy.

```
TF-IDF is the best out of the three models presented.

It had an outstanding perforance for train and test splits
by having the highest accuracy score when compared to the other two models.

It also gives a high precision percentage which will prove effective when
classifying positive and negative tweets.

In terms of sensitivity, it is higher in comparisment to the other models
giving it a winning edge at identifying positive and negative tweets.

As for specificity, it has a lower chance of classyfing positive and negative
tweets incorrectly.
```

## ☑️ Task 2: Explore Wrong Predictions (30 Points)

**🎯Task:** The following code cells provide the list of tweets that were misclassified by each one of the three models. You must see them in detail and try to determine possible reasons for these misclassifications. You must pay attention to the preprocessing, the cleaning and the processing stages. It's possible that some tweets contain details that make them difficult to classify. You must also verify if the words are present in the vocabulary of each Vectorizer.

In [None]:
def trainAndSeeErrors(model, tweets, labels):
  matrix = tweets.toarray()
  for i in range(len(labels)):
    pred = model.predict([matrix[i]])
    real = labels[i]
    if pred != real:
      print('---------')
      if real == 1: label = '(+)'
      else: label = '(-)'
      print(f'{label} -> {allTweets[i]}')

In [None]:
trainAndSeeErrors(tfResults['model'], tfRepresentation, allLabels)

---------
(+) -> @ellekagaoan @chinmarquez Catch up once in a while :( &gt;:D&lt; @aditriphosphate @ErinMonzon
---------
(-) -> @sainsburys guys a really unlucky one. The driver and I briefly checked eggs but my other half spotted this : ( http://t.co/WpCqJHhBVk
---------
(-) -> all time looww(:(
---------
(-) -> stu is mean, i just wanna sleep : (
---------
(-) -> @PSYCRM you still haven't ! ! : (
---------
(-) -> @c_tuilagi Anytime Lil Nigga!! (: (:
---------
(-) -> 20 losing streak... sad (:-(
---------
(-) -> i pOPPED CONFETTI THOUGH ! ! : ( https://t.co/Y79gPDxTIE
---------
(-) -> Zehr khany ka time is coming soon.....: (
---------
(-) -> Annnd, now not going to Winchester {:-(
---------
(-) -> pats jay : (
---------
(-) -> my beloved grandmother : ( https://t.co/wt4oXq5xCf
---------
(-) -> @CHEDA_KHAN Thats life. I get calls from people I havent seen in 20 years and its always favours : (
---------
(-) -> Sr. Financial Analyst - Expedia, Inc.: (#Bellevue, WA) http://t.co/ktknMhvw

In [None]:
trainAndSeeErrors(tfNormResults['model'], tfNormalizedRepresentation, allLabels)

---------
(+) -> @ellekagaoan @chinmarquez Catch up once in a while :( &gt;:D&lt; @aditriphosphate @ErinMonzon
---------
(-) -> all time looww(:(
---------
(-) -> stu is mean, i just wanna sleep : (
---------
(-) -> @c_tuilagi Anytime Lil Nigga!! (: (:
---------
(-) -> i pOPPED CONFETTI THOUGH ! ! : ( https://t.co/Y79gPDxTIE
---------
(-) -> Zehr khany ka time is coming soon.....: (
---------
(-) -> Annnd, now not going to Winchester {:-(
---------
(-) -> pats jay : (
---------
(-) -> my beloved grandmother : ( https://t.co/wt4oXq5xCf
---------
(-) -> @CHEDA_KHAN Thats life. I get calls from people I havent seen in 20 years and its always favours : (
---------
(-) -> Sr. Financial Analyst - Expedia, Inc.: (#Bellevue, WA) http://t.co/ktknMhvwCI #Finance #ExpediaJobs #Job #Jobs #Hiring


In [None]:
trainAndSeeErrors(tfIdfResults['model'], tfIdfRepresentation, allLabels)

---------
(+) -> Remember that one time I didn't go to flume/kaytranada/alunageorge even though I had tickets? I still want to kms. : ) : )
---------
(+) -> FNAF 4 dropped...looks like no sleep 4 me : )))))
---------
(+) -> @ellekagaoan @chinmarquez Catch up once in a while :( &gt;:D&lt; @aditriphosphate @ErinMonzon
---------
(-) -> @Israelgirly They sure do, esp now when ppl are talking crap about Millie!! &gt;:( I'll go straight to that FB page:)
---------
(-) -> @wtfxmbs AMBS please it's harry's jeans :)):):):(
---------
(-) -> i pOPPED CONFETTI THOUGH ! ! : ( https://t.co/Y79gPDxTIE
---------
(-) -> @Mickb1980 @CalderClarion @ev2cycling Looks good pal. Glad I paid £111 for my jersey and gilet! : (
---------
(-) -> Annnd, now not going to Winchester {:-(
---------
(-) -> pats jay : (
---------
(-) -> my beloved grandmother : ( https://t.co/wt4oXq5xCf
---------
(-) -> Sr. Financial Analyst - Expedia, Inc.: (#Bellevue, WA) http://t.co/ktknMhvwCI #Finance #ExpediaJobs #Job #Jobs #Hirin

In [None]:
#####################       WRITE YOUR CODE HERE       #####################
myTestTweet = '''
Sr. Financial Analyst - Expedia, Inc.: (#Bellevue, WA) http://t.co/ktknMhvwCI #Finance #ExpediaJobs #Job #Jobs #Hiring
'''
############################################################################
myTestTweet = myTestTweet.replace('\n',' ')
preprocessedTweet = preprocessTweet(myTestTweet)
tokenizedTweet = tokenizeTweet(preprocessedTweet)

print(f'Original Tweet:\t\t{myTestTweet}')
print(f'Pre-processed Tweet:\t{preprocessedTweet}')
print(f'Tokenized Tweet:\t{tokenizedTweet}')
printTokensInVocabs(tokenizedTweet)

Original Tweet:		 Sr. Financial Analyst - Expedia, Inc.: (#Bellevue, WA) http://t.co/ktknMhvwCI #Finance #ExpediaJobs #Job #Jobs #Hiring 
Pre-processed Tweet:	 Sr. Financial Analyst - Expedia, Inc.: (#Bellevue, WA) #Finance #ExpediaJobs #Job #Jobs #Hiring 
Tokenized Tweet:	['sr', 'financial', 'analyst', 'expedia', 'inc', 'bellevue', 'wa', 'finance', 'expedia', 'jobs', 'job', 'jobs', 'hiring']
Tokens in the Vocabulary of CountVectorizer: 	['job']
Tokens in the Vocabulary of TF Normalized: 	['job']
Tokens in the Vocabulary of TfIdfVectorizer: 	['job']


⁉️ **Question (30 Points)** 🧐

What did you observe exploring the tweets that were misclassified? Is there a problem in the preprocessing? Is there a problem in the tokenization? Is it a problem of the model? Are there samples that have a wrong label?

```
When inspecting each individual misclassified tweet I came to a realizations
that they shared similar characters, which was a sad emoji.
I believe that they are misclassified due to that specific reason, as the models
see the emoji and classify it as negative due to nature of the emoji.
In some cases the order in which the happy emoji is written has been reversed.
Example, original = ":)", in some tweets = "(:".
This could also possibly lead to misclassification.

The pre-processing seems to be working perfectly, it removes mentions from tweets,
as well as hyperlinks. In my opinion pre-processing could handle emojis significantly
better as it gives it gives errors during tokenization.

The tokenization seems to ignore sign emojis, this may be caused due to
whitespace in between the two signs. While tokenizing the tweets, the cleaning
process appears to be cleaning it incorrectly. For example, in one of the Tweets,
the word "pOPPED" has been played into the tokenized as following ["p","opped"].
The words even though it should be have been kept together is being split due
to the splitting function, in which words are split if they consist of lower and
upper case.

I do not believe it is an issue with the model.

In my personal opinion I believe that there are samples which have been labeled
incorrectly. In most cases it is positive tweets being classified as negative tweets.
```

What would you do to correct those misclassifications?

```
I believe that the correct course of action would be to clarify and debug the tokenization
process in order to avoid issues such as incorrect splitting of words or ignoring emojis.
```

## ☑️ Task 3: Filter Bigrams (30 points)

**NOTE:** The answers for each task must be located in the text cell at the bottom.

**🎯 Task 3.1:** Explore the bigrams extracted from the positive tweets, and define which are the 10 most relevant bigrams of positive tweets. You must take into account different factors, such as Bigram Frequency, Bigram PMI, How meaningful are them, How important are them to determine if a tweet is positive or not.

**🎯 Task 3.2:** Explain the process that you performed to get your list of 10 most important positive bigrams. You have two options to get this list: the first option is using the interactive cell. You can filter the bigrams setting a minimum PMI, a minimum Bigram Frequency and specifying a list of words that you want to remove. The second option is exploring the entire dataframe and the full list of bigrams and filtering it manually using Python code, and defining your own criteria.

**🎯 Task 3.3:** Explore the bigrams extracted from the negative tweets, and define which are the 10 most relevant bigrams of negative tweets. You must take into account different factors, such as Bigram Frequency, Bigram PMI, How meaningful are them, How important are them to determine if a tweet is negative or not.

**🎯 Task 3.4:** Explain the process that you performed to get your list of 10 most important negative bigrams. You have two options to get this list: the first option is using the interactive cell. You can filter the bigrams setting a minimum PMI, a minimum Bigram Frequency and specifying a list of words that you want to remove. The second option is exploring the entire dataframe and the full list of bigrams and filtering it manually using Python code, and defining your own criteria.

In [None]:
#@title Auxiliar Functions ⚠️
#@markdown ⚡ Run this cell to load the functions that will allow you to filter the bigrams

def calculatePMI(data):
  word1Prob = data['word1Prob']
  word2Prob = data['word2Prob']
  bigramProb = data['bigramProb']
  pmi = math.log(bigramProb/(word1Prob*word2Prob))
  return pmi

def filterStopWords(data,stopWords):
  word1 = data['word1']
  word2 = data['word2']
  if word1 in stopWords or word1 in englishStopWords:
    return False
  if word2 in stopWords or word2 in englishStopWords:
    return False
  return True

def getBigramIndicators(tweets):
  tweets = ' '.join(tweets)
  tweets = preprocessTweet(tweets)
  words = tokenizeTweet(tweets)
  bigrams = nltk.bigrams(words)
  bigrams = list(bigrams)
  bigramFrequencies = nltk.FreqDist(bigrams)
  wordFrequencies = nltk.FreqDist(words)
  df = pd.DataFrame()
  df['bigram'] = list(set(bigrams))
  df['word1'] = df['bigram'].apply(lambda bigram: bigram[0])
  df['word2'] = df['bigram'].apply(lambda bigram: bigram[1])
  df['bigramFreq'] = df['bigram'].apply(lambda bigram: bigramFrequencies[bigram])
  df['word1Freq'] = df['word1'].apply(lambda word1: wordFrequencies[word1])
  df['word2Freq'] = df['word2'].apply(lambda word2: wordFrequencies[word2])
  df['bigramProb'] = df['bigramFreq'].apply(lambda freq: freq/len(bigrams))
  df['word1Prob'] = df['word1Freq'].apply(lambda wordFreq: wordFreq/len(words))
  df['word2Prob'] = df['word2Freq'].apply(lambda wordFreq: wordFreq/len(words))
  df['pmi'] = df.apply(lambda data: calculatePMI(data), axis=1)
  return df

def removeRepeatedTweets(tweets):
  beginnings = set()

  uniqueTweets = []

  for tweet in tweets:
    beginning = tweet[:10]
    if beginning not in beginnings:
      uniqueTweets.append(tweet)
      beginnings.add(beginning)

  return uniqueTweets

uniquePositive = removeRepeatedTweets(positiveTweets)
uniqueNegative = removeRepeatedTweets(negativeTweets)

In [None]:
def filterBigrams(dfBigrams, minPMI, minBigramFreq):
  df = dfBigrams
  df = df[df['bigramFreq'] >= minBigramFreq]
  df = df[df['pmi'] >= minPMI]
  return df

In [None]:
positiveBigrams = getBigramIndicators(positiveTweets)
negativeBigrams = getBigramIndicators(negativeTweets)
allBigrams = getBigramIndicators(allTweets)

In [None]:
#@title Analize Bigrams from positive Tweets (Option 1)
#@markdown ⚡ Run this cell to filter the different bigrams, specitying minimum values for PMI and Bigram Frequency.
#@markdown You can write multiple words to be filtered in the text box separating them by a comma.

# Interactive Controls
minPMI = FloatSlider(min=-2, max=11, step=0.5, value=-2,description='Min PMI',layout=Layout(width='500px'))
minBigramFreq = FloatSlider(min=1, max=150, step=1, value=1,description='Min Bigram Freq',layout=Layout(width='500px'))
WordsToRemove = ":),:-),u,:d,:p,yet,the,and,an,a,good,follow" #@param {type:"string"}

positiveBigrams = getBigramIndicators(uniquePositive)
stopWords = WordsToRemove.replace(' ','').split(',')

@interact
def filterPositive(minPMI=minPMI, minBigramFreq=minBigramFreq):
  df = filterBigrams(positiveBigrams, minPMI, minBigramFreq)
  return df[df.apply(lambda data: filterStopWords(data, stopWords), axis=1)]

#filterPositive(minPMI=minPMI, minBigramFreq=minBigramFreq)

interactive(children=(FloatSlider(value=-2.0, description='Min PMI', layout=Layout(width='500px'), max=11.0, m…

If you want to filter the bigrams by yourself and explore them in more detail, you can use the following cell of code to do your own filters and calculations. ⚠️ DON'T MODIFY positiveBigrams DIRECTLY, MODIFY df instead.

In [None]:
####### Write your code here #######

# Use this cell to explore the positiveTweets to validate that the bigrams you chose are really relevant or not

data = positiveTweets
data[90:150]

####################################

['I added a video to a @YouTube playlist http://t.co/HVVPhSYakA im back on twitch and today it going to be league :) - 1 / 3',
 '#FollowFriday @AmericanOGrain @PecomeP @APaulicand for being top supports in my community this week :)',
 '@ZaynZaynmalik30  follow @jnlazts &amp; http://t.co/RCvcYYO0Iq follow u back :)',
 "Gym Monday can't wait :). Likes",
 "@HarNiLiZaLouis Hey, here's your invite to join Scope as an influencer :)  http://t.co/rZgZtQ2fJT",
 'Those friends know themselves :)',
 'waiting for nudes :-)',
 '@JacobWhitesides go sleep u ! :)))))))))',
 'Stats for the day have arrived. 1 new follower and NO unfollowers :) via http://t.co/RB8pMNgMEo.',
 'My birthday is a week today! :D',
 "@metalgear_jp @Kojima_Hideo I want you're T-shirts ! They are so cool ! :D",
 '@AxeRade haw phela if am not looking like Mom obviously am looking like him :)',
 '@zaynmalik prince charming on stage :) x https://t.co/OnVFhzt5fZ',
 'i have really good luck :)',
 'Stats for the day have arrived. 1 n

In [None]:
####### Write your code here #######

# Use this cell to explore the DataFrame that contains the Bigrams present in the Positive Tweets

df = positiveBigrams
df[70:120]

####################################

Unnamed: 0,bigram,word1,word2,bigramFreq,word1Freq,word2Freq,bigramProb,word1Prob,word2Prob,pmi
70,"(almost, views)",almost,views,1,10,7,2.1e-05,0.000214,0.00015,6.504374
71,"(me, ;))",me,;),1,324,24,2.1e-05,0.006929,0.000513,1.794072
72,"(now, :-))",now,:-),5,120,626,0.000107,0.002566,0.013387,1.135465
73,"(or, blue)",or,blue,1,60,8,2.1e-05,0.001283,0.000171,4.579083
74,"(for, tonights)",for,tonights,1,619,1,2.1e-05,0.013237,2.1e-05,4.324764
75,"(good, start)",good,start,1,210,42,2.1e-05,0.004491,0.000898,1.668092
76,"(of, maple)",of,maple,1,379,2,2.1e-05,0.008105,4.3e-05,4.122186
77,"(:-), nuf)",:-),nuf,1,626,1,2.1e-05,0.013387,2.1e-05,4.313519
78,"(south, korea)",south,korea,1,2,3,2.1e-05,4.3e-05,6.4e-05,8.961109
79,"(to, drink)",to,drink,1,996,6,2.1e-05,0.021299,0.000128,2.057362


In [None]:
#@title Analize Bigrams from Negative Tweets
#@markdown ⚡ Run this cell to filter the different bigrams, specitying minimum values for PMI and Bigram Frequency.

# Interactive Controls
minPMI = FloatSlider(min=-2, max=11, step=0.5, value=-2,description='Min PMI',layout=Layout(width='500px'))
minBigramFreq = FloatSlider(min=1, max=150, step=1, value=1,description='Min Bigram Freq',layout=Layout(width='500px'))
WordsToRemove = ":(,and,at,a,an,follow,please" #@param {type:"string"}

@interact
def filterPositive(minPMI=minPMI, minBigramFreq=minBigramFreq):
  negativeBigrams = getBigramIndicators(negativeTweets)
  stopWords = WordsToRemove.replace(' ','').split(',')
  df = filterBigrams(negativeBigrams, minPMI, minBigramFreq)
  return df[df.apply(lambda data: filterStopWords(data, stopWords), axis=1)]

interactive(children=(FloatSlider(value=-2.0, description='Min PMI', layout=Layout(width='500px'), max=11.0, m…

In [None]:
####### Write your code here #######

# Use this cell to explore the negativeTweets to validate that the bigrams you chose are really relevant or not

data = negativeTweets
data[:10]

####################################

['hopeless for tmr :(',
 "Everything in the kids section of IKEA is so cute. Shame I'm nearly 19 in 2 months :(",
 '@Hegelbon That heart sliding into the waste basket. :(',
 '“@ketchBurning: I hate Japanese call him "bani" :( :(”\n\nMe too',
 'Dang starting next week I have "work" :(',
 "oh god, my babies' faces :( https://t.co/9fcwGvaki0",
 '@RileyMcDonough make me smile :((',
 '@f0ggstar @stuartthull work neighbour on motors. Asked why and he said hates the updates on search :( http://t.co/XvmTUikWln',
 'why?:("@tahuodyy: sialan:( https://t.co/Hv1i0xcrL2"',
 'Athabasca glacier was there in #1948 :-( #athabasca #glacier #jasper #jaspernationalpark #alberta #explorealberta #… http://t.co/dZZdqmf7Cz']

In [None]:
####### Write your code here #######

# Use this cell to explore the DataFrame that contains the Bigrams present in the Negative Tweets

df = negativeBigrams
df[350:380]

####################################

Unnamed: 0,bigram,word1,word2,bigramFreq,word1Freq,word2Freq,bigramProb,word1Prob,word2Prob,pmi
350,"(travel, :-()",travel,:-(,1,9,501,1.9e-05,0.000167,0.009286,2.482056
351,"(simpson, concert)",simpson,concert,1,2,9,1.9e-05,3.7e-05,0.000167,8.005515
352,"(my, co-worker)",my,co-worker,1,745,1,1.9e-05,0.013808,1.9e-05,4.282503
353,"(agessss, :()",agessss,:(,1,1,4584,1.9e-05,1.9e-05,0.084963,2.46556
354,"(photo, taken)",photo,taken,1,12,9,1.9e-05,0.000222,0.000167,6.213756
355,"(out, his)",out,his,1,120,39,1.9e-05,0.002224,0.000723,2.444834
356,"(the, argument)",the,argument,1,921,1,1.9e-05,0.01707,1.9e-05,4.070427
357,"(hav, phone)",hav,phone,1,2,35,1.9e-05,3.7e-05,0.000649,6.647392
358,"(naomi, :()",naomi,:(,1,1,4584,1.9e-05,1.9e-05,0.084963,2.46556
359,"(need, ouat)",need,ouat,1,96,1,1.9e-05,0.001779,1.9e-05,6.331539


⁉️ **Question (30 Points)** 🧐

**Part 1:** Important Bigrams in Positive Tweets ⚠️  
Which are the 10 most important bigrams in positive tweets?
```
("good", "luck")
("can't","wait")
("final","design")
("happy","birthday")
("display","enabled")
("an","intelectual")
("south","korea")
("light","bulbs")
("kind","words")
("someone","cares")
```

**Part 2:** Obtaining your Positive Bigrams ⚠️  
How did you obtained this list of bigrams? Did you do some extra filtering?
Explain the process that you performed to get your results.
```
I began to obtain my list of bigrams by removing some stop words and emojis
from the list. Next, I examined the frequency of the bigrams and their
pointwise mutual information (PMI). I selected samples which had high PMI,
as it showed they words were mutual. In some cases the Bigram Frequency was
high and other times low for my results but the PMI stayed reletively high.
An example of this can be ("an","intelectual") where the Bigram Frequency was
one but the PMI was above eight.
```

**Part 3:** Important Bigrams in Negative Tweets ⚠️  
Which are the 10 most important bigrams in negative tweets?

```
("come","back")
("please","follow")
("goodbye","stage")
("last","night")
("ice","cream")
("power","station")
("ain't","leaving")
("artists","music")
("forgetting","you're")
("can't","sleep")
```

**Part 4:** Obtaining your Negative Bigrams ⚠️  
How did you obtained this list of bigrams? Did you do some extra filtering?
Explain the process that you performed to get your results.
```
Just like with the positive bigrams, I began to obtain my list of bigrams
by removing some stop words and emojis from the list. Next, I examined the
frequency of the bigrams and their pointwise mutual information (PMI).
I selected samples which had high PMI, as it showed they words were mutual.
In some cases the Bigram Frequency was high and other times low for my results
but the PMI stayed reletively high.

The bigram in the negative tweets did not vary from those in the positive.
The main differece I came to notice was the emojis and syntaxes and more informal
language.
```