# Natural Language Processing

In this homework assignment, you will tackle three distinct tasks involving text analysis:

1. Tweet Classification: Predict whether a specific tweet pertains to a natural disaster.
2. Bank Q&A Analysis: Deduce the user's query based on the text provided.
3. Fake News Classification: Determine whether a piece of news is true or false.

For educational reasons, we have retained all utility cells for data downloads. However, you can find everything you need in the corresponding GitHub repository subfolder.

# Solution Approach

Our solution approach will remain relatively consistent for all tasks:

1. Encode the text using a certain Language Model (LM), transforming each piece of text into a vector.
2. Implement a standard classification model (such as Logistic Regression, Random Forest, etc.) using these features.

Despite the varied nature of these tasks, you'll find that this approach provides a solid baseline solution for all three.

In [None]:
# !pip install kaggle

In [None]:
# !kaggle competitions download -c nlp-getting-started

In [None]:
# !unzip nlp-getting-started.zip

# Task 1. Tweets

https://www.kaggle.com/competitions/nlp-getting-started


1. What is this dataset about?
2. Encode text with LM, what is the dimensionality of the resulting embeddings?
3. Plot TSNE describe the graph (run TSNE on 128 PCA components).
4. Run LR, use 5 fold cross-validation, which metrics are appropriate for this task?
5. Comment on model performance
6. Explore the results (use out-of-fold predictions), find 3 False Positive tweets, which do not really look like a disaster, e.g.:
- https://twitter.com/shauniefish/status/649148030290006017 `I just checked in! \x89ÛÒ at On Fire on @ZomatoAUS #LoveFood http://t.co/9l5kqykrbG`

## 1.1 What is this dataset about?

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('./data/tweets/train.csv', index_col=0) # Use direct link to the github file if you are working in colab

In [None]:
df.head(3)

## 1.2 Encode text with LM, what is the dimensionality of the resulting embeddings?

In [None]:
# !pip install langchain

In [None]:
# !pip install sentence-transformers

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings
from tqdm.notebook import tqdm

In [None]:
embeddings = HuggingFaceEmbeddings()
tweets_embeddings = []

for _, tweet in tqdm(df.iterrows()):
    vec = embeddings.embed_query(tweet.text)
    tweets_embeddings.append(vec)

## 1.3 Plot TSNE describe the graph (run TSNE on 128 PCA components).

In [None]:
import numpy as np
from openTSNE import TSNE
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# TODO

pca = PCA(128)
X_pca = ...
tsne_embedding = ...

## 1.4 Run LR, use 5 fold cross-validation, which metrics are appropriate for this task?

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import accuracy_score, roc_auc_score

## 1.5 Comment on model performance

## 1.6     Explore the results (use out-of-fold predictions), find 3 False Positive tweets, which do not really look like a disaster, e.g.:

    https://twitter.com/shauniefish/status/649148030290006017 I just checked in! \x89ÛÒ at On Fire on @ZomatoAUS #LoveFood http://t.co/9l5kqykrbG


In [None]:
for i, (true, prediction) in enumerate(zip(df.target, y_pred)):
    if true != prediction:
        print(true)
        print(prediction)
        print(df.iloc[i].text)
        print('======')

# Task 2. Bank Customers' Q&A system

https://huggingface.co/datasets/PolyAI/banking77


1. What is this dataset about?
2. What is the minimal and maximal median text length for different classes (e.g. median text length for `atm_support` is 35).
3. Encode text with LM
4. Run RF, use 5 fold cross-validation, which metrics are appropriate for this task?
5. Comment on model performance
6. Analyze the errors of your model (use out-of-fold predictions), which two classes are mostly confused by your model?
7. (optional) plot a TSNE graph, with all observations, but color only two classes from the previous question. Make other points  color ligth gray, comment on the graph.

## 2.1 What is this dataset about?

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/PolyAI-LDN/task-specific-datasets/master/banking_data/train.csv')

In [None]:
df.head(3)

## 2.2 What is the minimal and maximal median text length for different classes?
e.g. median text length for atm_support is 35.

## 2.3 Encode text with LM

## 2.4 Run RF, use 5 fold cross-validation, which metrics are appropriate for this task?

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
# TODO

clf = RandomForestClassifier(10)
y_pred = ...

## 2.5 Comment on model performance

How many classes in this dataset?

## 2.6 Analyze the errors of your model (use out-of-fold predictions). Which two classes are confused by your model the most?

In [None]:
errors = dict()

for i,(true,predicted) in enumerate(zip(df.category, y_pred)):
    if true!=predicted:
        key = '-'.join(sorted([true,predicted]))
        if key in errors:
            errors[key] += 1
        else:
            errors[key] = 1
        print(true)
        print(predicted)
        print(df.iloc[i].text)
        print('======')

In [None]:
sorted(errors.items(), key=lambda x: x[1], reverse=True)

## 2.7 (optional) plot a TSNE graph, with all observations, but color only two classes from the previous question. Make other points color ligth gray, comment on the graph. Analyze model's errors.

# Task 3. Fake news

https://www.kaggle.com/datasets/jainpooja/fake-news-detection

1. What is this dataset about?
2. How many unique subjects are in True news and Fake news?
3. Encode text with LM
4. Run LR, use 5 fold cross-validation, which metrics are appropriate for this task?
5. Comment on model performance, would you prefer a model with high Recall or with high Precision?
6. (optional) Analyze class distribution of the model. How many articles mentioning "Trump" are Fake? How many articles not mentioning "Trump" are Fake? Same question for "Obama". Can you say that this dataset is biased, explain?
7. (optional) using your model find False Positives which are actually True statements (news).


In [None]:
# !kaggle datasets download -d jainpooja/fake-news-detection

In [None]:
# !unzip fake-news-detection.zip

## 3.1 What is this dataset about?

In [None]:
df_fake = pd.read_csv('./data/fake-news/Fake.csv') # Use direct link to the github file if you are working in colab
df_true = pd.read_csv('./data/fake-news/True.csv')
df = pd.concat([df_fake, df_true])
df['target'] = [1]*df_fake.shape[0]+[0]*df_true.shape[0] # 1 if Fake, 0 o/w

## 3.2 How many unique subjects are in True news and Fake news?

## 3.3 Encode text with LM. 

due to dataset size it will take ~15 mins on average PC

## 3.4 Run LR, use 5 fold cross-validation, which metrics are appropriate for this task?

## 3.5 Comment on model performance, would you prefer a model with high Recall or with high Precision?

## 3.6 (optional) Analyze class distribution of the model. 

How many articles mentioning "Trump" are Fake? How many articles not mentioning "Trump" are Fake? Same question for "Obama". Can you say that this dataset is biased? Explain.

In [None]:
df['is_trump'] = df['text'].apply(lambda x: 'Trump' in x)
df['is_obama'] = df['text'].apply(lambda x: 'Obama' in x)

## 3.7 (optional) using your model find False Positives (news which have target "Fake" but are actually True).