# **Machine Learning in Python**
Machine learning (ML) and artificial intelligence (AI) are incredibly important aspects of language processing nowadays. Often, high-resource languages such as English and Spanish have enormous ML models, trained on billions and billions of words. However, we can also use machine learning, at a smaller scale, for low-resource languages.

This lesson will give a brief overview of using a machine learning framework in Python, but it won't go into a great amount of depth. To learn more about using machine learning for language processing, we recommend the book [Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/) by Jurafsky and Martin.

## **Sentiment Analysis**
The task we will be solving with an ML model is **sentiment analysis**. Sentiment analysis aims to predict whether a chunk of text expresses a positive, negative, or neutral sentiment about the topic. We will use a dataset of Tweets from [here](https://www.kaggle.com/datasets/yasserh/twitter-tweets-sentiment-dataset?resource=download). For instance, the following tweet is classified as positive:

> Screw the reviews, I thought Wolverine was awesome. But not enough Dominic Monaghan for my liking.

But this tweet is negative:

> THIS twitter is driving me nuts...WONT LET ME DOWNLOAD A PROFILE PIC!! ...guess i`ll keep trying!!

## **Loading Data**
First, we need to load our dataset and prepare it to use in a model. Since the data is in a csv format, we must use the `csv` module to help parse it.

In [None]:
import csv

all_tweets = []
all_sentiments = []

with open('./Tweets.csv') as tweets_file:
    # Create a csv parser
    csv_reader = csv.reader(tweets_file)
    
    # Skip the first row, the headers
    next(csv_reader, None)
    
    for row in csv_reader:
        # The second column has the tweet text
        all_tweets.append(row[1])
        
        # The fourth column has the sentiment label
        all_sentiments.append(row[3])
        
print(all_tweets[:10])
print(all_sentiments[:10])

We will also replace each sentiment with a label: 0 for neutral, 1 for positive, and 2 for negative.

In [None]:
all_sentiments_encoded = []
for sentiment in all_sentiments:
    if sentiment == 'neutral':
        all_sentiments_encoded.append(0)
    elif sentiment == 'positive':
        all_sentiments_encoded.append(1)
    elif sentiment == 'negative':
        all_sentiments_encoded.append(2)
    else:
        print("Unexpected label found")
        break
        
print(all_sentiments_encoded[:10])

### Training/testing split
In machine learning, it is traditional to divide data into two datasets: one for training the model, and one to test the model's performance afterward. This helps to evaluate the model fairly and to avoid *overfitting*, where the model only works well on the trianing data.

We will use the `train_test_split` method from the `sklearn` package.

In [None]:
from sklearn.model_selection import train_test_split

train_sentences, test_sentences, train_labels, test_labels = train_test_split(all_tweets, all_sentiments_encoded, test_size=0.3)

print(len(train_sentences), "training sentences")
print(len(test_sentences), "testing sentences")

## **Creating a Model**
Now we are ready to create our model. We will use [Keras](https://keras.io), one of the popular frameworks for machine learning. Keras offers the easiest setup but the least customizability, making it a good choice for this lesson.

### Vectorization
First, our model will convert each sentence into a vector of binary values, where each position represents the occurrence of a word. For instance, if the vector for a sentence starts with `[1, 0, ...]` and the words are `[bad, good, ...]`, then the vector indicates that the word `bad` occurs in the tweet but the word `good ` does not.

For this, we use the keras `TextVectorization` layer.

In [None]:
from tensorflow import keras

text_vectorizer = keras.layers.TextVectorization(output_mode='multi_hot', # Create a vector in the style we described
                                                 max_tokens=2500)         # Use only the 2500 most common words

# Train the vectorizer using the training dataset
text_vectorizer.adapt(train_sentences)

Now we can see the top 100 most common words in the dataset.

In [None]:
print(text_vectorizer.get_vocabulary()[:100])

We can also use the vectorizer to encode a sentence and see what the result looks like:

In [None]:
text_vectorizer("I went to the store")

### Hidden Layers
<div>
<img src="../../assets/mlp.png" width="500" style=" display: block; margin-left: auto; margin-right: auto;"/>
</div>

One of the key techniques used in machine learning is the incorporation of **hidden layers**. At each hidden layer, a function is applied to the inputs with weight variables that modify the output. The model adjusts the weight variables during training until the correct output is predicted.

Using more hidden layers allows for a model that can learn more complicated patterns. In this case, the weights will determine how much each word contributes to the final prediction. 

In [None]:
# The parameter indicates how many nodes will be in each layer
hidden_layer1 = keras.layers.Dense(100)

### Putting it together
Now, we can put together all the components of our model.

Every model must use a **loss function**, which defines how much error there is. The model will attempt to minimize the loss function and thus make better predictions. In this case, we use **crossentropy loss**, which calculates how much error there was in a prediction between categorical labels.

In [None]:
model = keras.Sequential()

# Our inputs will be strings
input_layer = keras.Input(shape=(1,), dtype='string')
model.add(input_layer)

# Add the vectorization layer
model.add(text_vectorizer)

# Add the hidden layers
model.add(hidden_layer1)

# Add the output layer
# Since we have 3 possible output classes, the layer should have three nodes
output_layer = keras.layers.Dense(3, activation='softmax')
model.add(output_layer)

# Compile the model
# We use `categorical_crossentropy` for tasks that have multiple categorical outputs
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

model.summary()

### Training
Now that we've built a model, the next step is training it. This may take some time, since training involves many large matrix operations.

In [None]:
history = model.fit(train_sentences, train_labels, epochs=7, verbose=True)

We can see that as our loss decreased, training accuracy increased. 

### Evaluation
Now, let's evaluate the model on our test data.

In [None]:
model.evaluate(test_sentences, test_labels)

Our test accuracy was somewhat lower than our training accuracy, but still far better than guessing at random. Building models where the test performance isn't significantly worse than the training performance is a key goal in machine learning.

Finally, let's see our model in action. We can use our model to predict the sentiment of some Tweet we make up.

In [None]:
import numpy as np

def predict_sentiment(tweet):
    # Our predictions will be a 3-element vector, where each element is the probability of a given sentiment class
    predictions = model.predict([tweet])[0]
    
    # Take the argmax to find the most likely sentiment
    predicted_sentiment_index = np.argmax(predictions)
    
    # Turn the sentiment index into a label
    all_sentiments = ['neutral', 'positive', 'negative']
    predicted_sentiment = all_sentiments[predicted_sentiment_index]
    
    return predicted_sentiment
    

print(predict_sentiment("I loved the new Guardians of the Galaxy movie. It was so well-made and touching!"))
print(predict_sentiment("I hated that movie. Gunn is a talentless hack"))

## **Challenge Exercise 1**
Try modifying the model used here to improve performance. Experiment with using a larger vocabulary in the `TextVectorizer`, using a different number of hidden layers, or hidden layers with a different number of nodes.

In [None]:
# TODO: Build and train a modified model


## **Challenge Exercise 2**
Download [this dataset](https://www.kaggle.com/datasets/azimulh/tweets-data-for-authorship-attribution-modelling). Create and train a model for predicting the author of a tweet. This is a similar problem to sentiment analysis, except we have more than 3 possible labels.

In [None]:
# TODO: Build and train a model for authorship prediction


## **Conclusion**
In this lesson, we learned what it looks like to create, train, and evaluate a machine learning model in Python for language processing tasks. 
- Creating vector representations of text using `TextVectorizer`
- Building a model with hidden layers
- Training a model on a training dataset
- Evaluating a model with a testing dataset

Machine learning can be a powerful tool for language applications, and it can be applied to a huge range of tasks. Regardless of the task, the basic techniques shown here will be used over and over. 

At this point, you've completed all of the skills necessary to begin building usable language technology. Next, take a look at the projects!