# **Natural Language Processing**
## *Practice 6 - Supervised Learning*

## Objectives

* To use supervised learning classifiers and apply them in a collection, making use of the Sklearn Python library.


### Supervised learning

Supervised learning allows future predictions based on behaviors or features analyzed in labeled historical data.

A label is the representation of the knowledge about the data.


#### 1. How it works

To understand how supervised learning works, we will use an example where we have a set of data labeled a priori: 1 apples, 2 citrus fruits, 3 watermelons and 4 bananas. From this dataset we extract its features in vectors (Tf-Idf or other types) and with this we train the machine learning algorithm.

The algorithm generates a model that we can provide it with new data sets and it classifies the new elements (never seen before) with what it has learned in the training phase.

<img src="https://www.diegocalvo.es/wp-content/uploads/2019/03/aprendizaje_supervisado.png" width="800" height="500" />




<img src="https://www.diegocalvo.es/wp-content/uploads/2019/03/clasificaci%C3%B3n_de_aprendizaje_supervisado.png" width="400" height="300" />

#### 2. Regression

Regression aims to predict continuous values from labeled historical data.

An example of this method is the estimation of a client's spending as a function of income and number of children.

Types of regression algorithms include:
* Linear regression
* Polynomial regression
* Support vector regression
* Decision trees regression
* Random forest regression

#### 3. Classification

It aims to classify into groups based on labeled historical data.

An example of this method is the estimation to classify fruits according to color, shape, texture...


<img src="https://www.diegocalvo.es/wp-content/uploads/2019/03/clasificaci%C3%B3n.png" width="280" height="200" />



Types of classification algorithms include:
* Logistic regression.
* Nearest neighbors.
* Support vector machines.
* Classification decision trees.
* Random forest classification.

#### Example

Example of Support Vector Machines (SVM) in Python:

1) Add the necessary libraries


In [1]:
from sklearn import svm

2) We add the texts in a Python list:

In [2]:
documents = ["This little kitty came to play when I was eating at a restaurant.",
             "Merley has the best squooshy kitten belly.",
             "Google Translate app is incredible.",
             "If you open 100 tab in google you get a smiley face.",
             "Best cat photo I've ever taken.",
             "Climbing ninja cat.",
             "Impressed with google map feedback.",
             "Key promoter extension for Google Chrome."]


3) We add the tags or classes to each document, we must take into account the position in the vector:


In [3]:
tags = ["cat", "cat", "google", "google", "cat", "cat", "google", "google"]

In this way the document in the first position documents[0] will have the tag in the first position tags[0], and so on consecutively.

4) We extract the features using TF-IDF:

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)

5) We create and train the SVM model with the default parameters, note that for training we need the tags of each document:


In [5]:
model = svm.SVC()
model.fit(X, tags)

6) Predict which group (cat or google) would be appropriate for our new text:


In [6]:
text = ["My chrome browser to open."]
Y = vectorizer.transform(text)
prediction = model.predict(Y)
print(text, prediction)
text = ["Drago's cat is cute."]
Y = vectorizer.transform(text)
prediction = model.predict(Y)
print(text, prediction)

['My chrome browser to open.'] ['google']
["Drago's cat is cute."] ['cat']


## Tutorials

SVM Tutorial: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

Tf-Idf Tutorial: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

## Exercises

The result of this practice must be delivered in PLATEA and has a deadline of **23:59 hours on March 31, 2025**. This same notebook with extension *.ipynb* will be submitted and renamed as follows: pr7_user1_user2.ipynb. Replace "user1" and "user2" with your email alias.

### Exercise 1

Use the **Gender** corpus available in the "Material Complementario" folder of Docencia Virtual to create a supervised model.

You can choose the classification method of your choice.

The Gender corpus has two available labels F (female) and M (male). We will try to predict whether a text is written by a woman or a man.

To do this, process the text of each document in the collection:
* Tokenize the text
* Remove tge empty words
* Reduces words to their root
* Change the text to lowercase

Subsequently, use the method seen in the Tf-Idf class on the already processed documents and train the model you have chosen.

Predict the possible gender of these texts that you will find in the text files: predic_gender1.txt, predic_gender2.txt and predic_gender3.txt. You will find these files in the "Material Complementario" folder of PLATEA.

**NOTE**: it is important to process the texts we want to predict in the same way as we have processed all the documents in the collection so that the model can find the same words.


### Text preprocessing and model training

In [8]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [13]:
from os import path
path = "/content/drive/MyDrive/NLP/gender/gender_corpus.xlsx"

In [14]:
import pandas as pd

In [18]:
df = pd.read_excel(path, usecols=[0, 1], header=None)

In [20]:
print(df.shape)

(3229, 2)


In [26]:
df['text_lower'] = df[0].apply(lambda x: str(x).lower())
df['label_lower'] = df[1].apply(lambda x: str(x).lower())

In [27]:
df.head()

Unnamed: 0,0,1,text_lower,label_lower
0,Long time no see. Like always I was rewriting...,M,long time no see. like always i was rewriting...,m
1,Guest Demo: Eric Iverson’s Itty Bitty Search\...,M,guest demo: eric iverson’s itty bitty search\...,m
2,Who moved my Cheese??? The world has been de...,M,who moved my cheese??? the world has been de...,m
3,Yesterday I attended a biweekly meeting of an...,M,yesterday i attended a biweekly meeting of an...,m
4,Liam is nothing like Natalie. Natalie never w...,F,liam is nothing like natalie. natalie never w...,f


In [28]:
df['label_lower'].value_counts()

Unnamed: 0_level_0,count
label_lower,Unnamed: 1_level_1
m,1551
f,1392
f,153
m,126
,5
f,1
m,1


In [31]:
df['label_clean'] = df['label_lower'].str.strip()
df['label_clean'] = df['label_clean'].replace('', pd.NA)

In [32]:
df['label_clean'].value_counts()

Unnamed: 0_level_0,count
label_clean,Unnamed: 1_level_1
m,1678
f,1546
,5


In [34]:
df = df.dropna(subset=['label_clean'])
df = df.dropna(subset=['text_lower'])

In [52]:
import re
def tokenization(text):
    tokens = re.split('\W+',text)
    return tokens
df['text_tokenized'] = df['text_lower'].apply(tokenization)

In [53]:
import nltk
nltk.download('stopwords')
stopwords = nltk.corpus.stopwords.words('english')

def remove_stopwords(text):
    output= [i for i in text if i not in stopwords]
    return output

df['text_no_stopwords'] = df['text_tokenized'].apply(remove_stopwords)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [54]:
from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()

def stemming(text):
  stem_text = [porter_stemmer.stem(word) for word in text]
  return stem_text

df['text_stemmed'] = df['text_no_stopwords'].apply(stemming)

In [55]:
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

def lemmatizer(text):
  lemm_text = [wordnet_lemmatizer.lemmatize(word) for word in text]
  return lemm_text

df['text_lemmatized'] = df['text_stemmed'].apply(lemmatizer)

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [56]:
df['text_processed'] = df['text_lemmatized'].apply(' '.join)

In [57]:
documents = df['text_processed'].tolist()
tags = df['label_clean'].tolist()

In [59]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)

In [60]:
model = svm.SVC()
model.fit(X, tags)

In [None]:
text = ["My chrome browser to open."]
Y = vectorizer.transform(text)
prediction = model.predict(Y)
print(text, prediction)
text = ["Drago's cat is cute."]
Y = vectorizer.transform(text)
prediction = model.predict(Y)
print(text, prediction)

### Prediction of lables

In [68]:
file_paths = [
    '/content/drive/MyDrive/NLP/gender/predic_gender1.txt',
    '/content/drive/MyDrive/NLP/gender/predic_gender2.txt',
    '/content/drive/MyDrive/NLP/gender/predic_gender3.txt'
]

In [74]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

def preprocess_text(text):
    tokens = re.findall(r'\w+', text.lower())
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return ' '.join(tokens)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [78]:
import builtins
import os
results = []
for file_path in file_paths:
    with builtins.open(file_path, 'r') as f:
      text = f.read()

    processed_text = preprocess_text(text)

    # Y = vectorizer.transform([processed_text])
    # prediction = model.predict(Y)
    # print(file_path, prediction)
    X = vectorizer.transform([processed_text])

    prediction = model.predict(X)

    results.append({
      'file': os.path.basename(file_path),
      'prediction': prediction
    })

In [80]:
result_df = pd.DataFrame(results)
print(result_df[['file', 'prediction']])

                 file prediction
0  predic_gender1.txt        [m]
1  predic_gender2.txt        [f]
2  predic_gender3.txt        [m]
