## Task 1. Feature extraction (15 points)
Use the map reduce model to convert all text data into matrices. Convert _ratings_ to vectors. These will be used for classification in Task 2. Use TF-IDF to vectorise the text files. See previous practical classes and lectures materials for TF-IDF. One step further though is to represent each text file (review) as a very long and sparse vector as the following. Assume `wordslist` is the final list of distinct words contained in all reviews and its length is $D$. Then each review will be a vector of length $D$, with each position associated with a word in `wordlist` and the value being either 0, if the corresponding word is absent in the review, or the word’s TF-IDF. For example, if `wordlist = [‘word1’, ‘word2’, ‘word3’, ‘word4’]` and review 1 contains `word1` and `word4`, then the vector representation of review 1 is [0.1, 0, 0, 0.4] assuming TF-IDF of `word1` and `word4` in review 1 is 0.1 and 0.4 respectively. Note that TF is calculated from one single document while IDF is obtained from all documents in the collection. 

### Requirements: 

1. Map reduce model is a must. Implement it using Hadoop streaming. All data are available on SCDMS HDFS. The recommendation is to work on the tiny version of the data to make the code work. You may try your code on the full version. However, the application to full version is not required. 
2. Generate two matrices: `training_data`, `test_data`, and two vectors, `training_targets`, `test_targets`. `training_data` should have $N$ rows and $D$ columns with each row corresponding to each review in the training set, where $N$ is the totally number of reviews in training set and $D$ is the total number of words. $N$ and $D$ vary depending on which version of the data you use. `training_targets` should have $N$ elements each of which is the rating of the review is for. `test_data` and `test_targets` are similar defined. 


## Task 1: submission 
Your work goes from here. Add blocks when neceesay. Add inline comments in python code or in markdown blocks. 

In [3]:
# Python code for task 1, including mapper and reducer, saved as task1.py, run by 'python task1.py', output test_data and test_targets
# train folder, test folder and all python files are in the same directory

import os
import sys

def extract_ratings_from_filenames(directory):
    ratings = []
    for file in os.listdir(directory):
        if file.endswith('.txt'):
            rating = int(file.split('_')[1].split('.')[0])
            ratings.append(rating)
    return ratings

def mapper(file):
    with open(file, 'r') as f:
        words = f.read().split()
    return words


def reducer(files):
    unique_words = set()  # To store unique words across all files
    matrix = []

    for file in files:
        words = mapper(file)
        unique_words.update(words)  # Update unique words set
        matrix.append(words)

    return matrix, list(unique_words)

def generate_word_index_map(unique_words):
    word_index_map = {}
    for idx, word in enumerate(unique_words):
        word_index_map[word] = idx
    return word_index_map

def generate_matrix(matrix, word_index_map):
    result_matrix = []
    for words in matrix:
        row = [0] * len(word_index_map)  # Initialize row with zeros
        for word in words:
            if word in word_index_map:
                idx = word_index_map[word]
                row[idx] += 1  # Increment count of word in row
        result_matrix.append(row)
    return result_matrix

def main():
    # Define the directory path (input directory), in this case use the test dataset, change to 'train' if it is for train dataset
    directory_path = "test"

    # Check if the specified directory exists
    if not os.path.isdir(directory_path):
        print("Error: Directory does not exist.")
        sys.exit(1)

    # Extract ratings from file names
    test_targets = extract_ratings_from_filenames(directory_path)

    # Generate list of files in the directory
    files = [os.path.join(directory_path, file) for file in os.listdir(directory_path) if file.endswith('.txt')]

    # Generate matrix and unique words list
    matrix, unique_words = reducer(files)

    # Generate word-index map
    word_index_map = generate_word_index_map(unique_words)

    # Generate matrix with N rows and D columns
    test_data = generate_matrix(matrix, word_index_map)

    # Print the result vector and matrix, this step is to make sure results are valid
    print("Test vector:", test_targets)
    print("Test Matrix:")
    for row in test_data:
        print(row)

if __name__ == "__main__":
    main()

Test vector: [7, 3, 8, 4, 7, 1, 1, 8, 2, 8, 10, 3]
Test Matrix:
[0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 2, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 3, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0,

In [None]:
%%bash
hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.3.6.jar \
  -mapper hadoopmapper1.py \
  -reducer hadoopreducer1.py \
  -input /users/bigdata/imdb/tinyversion \
  -output /hdfs/path/to/outputdirectory \
  -file local/path/to/hadoopmapper1.py \
  -file local/path/to/hadoopreducer1.py

In [5]:
# Python code for task 1, including mapper and reducer, saved as task1_train.py, run by 'python task1_train.py', output train_data and train_targets

import os
import sys

def extract_ratings_from_filenames(directory):
    ratings = []
    for file in os.listdir(directory):
        if file.endswith('.txt'):
            rating = int(file.split('_')[1].split('.')[0])
            ratings.append(rating)
    return ratings

def mapper(file):
    with open(file, 'r') as f:
        words = f.read().split()
    return words

def reducer(files):
    unique_words = set()  # To store unique words across all files
    matrix = []

    for file in files:
        words = mapper(file)
        unique_words.update(words)  # Update unique words set
        matrix.append(words)

    return matrix, list(unique_words)

def generate_word_index_map(unique_words):
    word_index_map = {}
    for idx, word in enumerate(unique_words):
        word_index_map[word] = idx
    return word_index_map

def generate_matrix(matrix, word_index_map):
    result_matrix = []
    for words in matrix:
        row = [0] * len(word_index_map)  # Initialize row with zeros
        for word in words:
            if word in word_index_map:
                idx = word_index_map[word]
                row[idx] += 1  # Increment count of word in row
        result_matrix.append(row)
    return result_matrix

if __name__ == "__main__":
    # Define the directory path (input directory)
    directory_path = "train"

    # Check if the specified directory exists
    if not os.path.isdir(directory_path):
        print("Error: Directory does not exist.")
        sys.exit(1)

    # Extract ratings from file names
    training_targets = extract_ratings_from_filenames(directory_path)

    # Generate list of files in the directory
    files = [os.path.join(directory_path, file) for file in os.listdir(directory_path) if file.endswith('.txt')]

    # Generate matrix and unique words list
    matrix, unique_words = reducer(files)

    # Generate word-index map
    word_index_map = generate_word_index_map(unique_words)

    # Generate matrix with N rows and D columns
    training_data = generate_matrix(matrix, word_index_map)

    # Print the result vector and matrix
    print("Train vector:", training_targets)
    print("Train Matrix:")
    for row in training_data:
        print(row)

Train vector: [7, 10, 3, 8, 7, 7, 2, 8, 4, 7, 1, 3, 1, 1, 8, 1, 2, 8, 10, 3, 4]
Train Matrix:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,

Run the following script in hadoop with the above python source file saved. 

In [None]:
%%bash
hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.3.6.jar \
  -mapper 'python hadoopmapreducer2.py map' \
  -reducer 'python hadoopmapreducer2.py reduce' \
  -input /users/bigdata/imdb/tinyversion \
  -output /hdfs/path/to/outputdirectory \
  -file local/path/to/hadoopmapreducer2.py 

<hr style="height:4px;border-width:0;color:gray;background-color:green">

## Task 2. Classification (15 points)
Construct a classification model for review sentiment prediction, meaning that given a customer review (taken from test set) about a movie, your program should be able to predict whether it is positive or negative. There is no limitation on how many classifiers and what specific model you should use. You can simply pick one that works for you for this task, either from those covered in lectures and practical classs or any other classifiers from any python packages. A good starting point is the `scikit-learn` (i.e. `sklearn`) package. 

<hr>

## Task 2: submission 
Your work goes from here. Add blocks when neceesay. Add inline comments in python code or in markdown blocks. 

Here goes my first python script. 

In [1]:
# Python code for Task 2: data preprocessing, saved as task2.py, run by 'python task2.py', output prediction accuracy

import os
import sys
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler

from task1 import extract_ratings_from_filenames 
from task1 import reducer 
from task1 import generate_word_index_map 
from task1 import generate_matrix 

def get_test_data():
    directory_path = "test"

    # Extract ratings from file names
    test_targets = extract_ratings_from_filenames(directory_path)

    files = [os.path.join(directory_path, file) for file in os.listdir(directory_path) if file.endswith('.txt')]

    # Generate matrix and unique words list
    matrix, unique_words = reducer(files)

    # Generate word-index map
    word_index_map = generate_word_index_map(unique_words)

    # Generate matrix with N rows and D columns
    test_data = generate_matrix(matrix, word_index_map)

    return test_data, test_targets, matrix, unique_words

def get_train_data():
    directory_path = "train"

    # Extract ratings from file names
    training_targets = extract_ratings_from_filenames(directory_path)

    files = [os.path.join(directory_path, file) for file in os.listdir(directory_path) if file.endswith('.txt')]

    # Generate matrix and unique words list
    matrix, unique_words = reducer(files)

    # Generate word-index map
    word_index_map = generate_word_index_map(unique_words)

    # Generate matrix with N rows and D columns
    training_data = generate_matrix(matrix, word_index_map)

    return training_data, training_targets, matrix, unique_words

def main():
    # Assuming you have identified different unique words for training and test sets
    training_data, training_targets, training_matrix, training_unique_words = get_train_data()
    test_data, test_targets, test_matrix, test_unique_words = get_test_data()

    # Create a union of all unique words
    all_unique_words = set(training_unique_words).union(set(test_unique_words))

    # Generate word-index map for combined vocabulary
    word_index_map = generate_word_index_map(all_unique_words)

    # Reconstruct matrices with combined vocabulary
    combined_training_data = generate_matrix(training_matrix, word_index_map)
    combined_test_data = generate_matrix(test_matrix, word_index_map)

    # Convert targets to binary labels
    training_labels = np.array([1 if rating > 5 else 0 for rating in training_targets])  # Convert to numpy array
    test_labels = np.array([1 if rating > 5 else 0 for rating in test_targets])

    # Apply Min-Max normalization
    scaler = MinMaxScaler()

    # Fit and transform training data
    normalized_train = scaler.fit_transform(combined_training_data)

    # Transform test data using fitted scaler
    normalized_test = scaler.transform(combined_test_data)

    # Define the parameter grid to search
    param_grid = {'max_depth': [5, 10, 15], 'n_estimators': [50, 100, 200]}

    # Create a Random Forest Classifier
    rfc = RandomForestClassifier()

    # Perform Grid Search Cross Validation
    grid_search = GridSearchCV(estimator=rfc, param_grid=param_grid, cv=3)
    grid_search.fit(normalized_train, training_labels)

    # Get the best parameters
    best_params = grid_search.best_params_

    # Apply the best model on test data
    best_model = RandomForestClassifier(**best_params)
    best_model.fit(normalized_train, training_labels)
    predictions = best_model.predict(normalized_test)

    #Calculate prediction accuracy
    accuracy = accuracy_score(test_labels, predictions)

    # Print predictions and accuracy
    print(f"Prediction Accuracy: {accuracy * 100:.2f}%")

if __name__ == "__main__":
    main()



Prediction Accuracy: 91.67%


<hr style="height:4px;border-width:0;color:gray;background-color:blue">

<hr>