# CS610 Assignment 1 [2023]

***Requirement on Submission***
1.	Submit ONLY your Jupyter Notebook and `requirements.txt`. 
    - Do not submit the given dataset.
    - Do not zip your Jupyter Notebook. 
    - Your answers to all questions should be in your Jupyter Notebook.


2.	Organize your code so that it clearly shows your intention and logic. To organize your code, you can either:
    - Add comments in your code, or 
    - Add a mark-up cell to explain what you want to do in the following code cell
    - You can use a mix of the above two throughout your Jupyter Notebook. No need to consistently use one way.


3.	Make sure your code can run without any adjustment on grader’s machine. 
    - Put all necessary packages in the `requirements.txt`
    - Adjust your answer before submission if you run your code on Google Colab and use any cloud drive. 
    - Note: the dataset is supposed to be unzipped and put in the same directory with your Jupyter Notebook.
    - Note: you can assume grader’s machine has installed the packages you use in your code.


4.	You may add additional cells if needed. Ensure your response to each question are in the correct location; if answers are placed under a different question, they may be disregarded. Tabulate your numbers when a table is provided for you to fill in.

5.	Limit your lines of code in a single code cell: no more than 50 lines, including comments.

***Failure to meet the requirements above will cause deduction in your grade.***

## Introduction

This assignment uses two datasets: `yelp.csv` and `census_income.csv` 

**Description of the data:**

`yelp.csv`
- Each observation (row) in this dataset is a review of a particular business by a particular user.
- The **stars** column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars are better.) In other words, it is the rating of the business by the person who wrote the review.
- The **text** column is the text of the review.
- **Goal:** Predict the star rating of a review using **only** the review text.

`census_income.csv`
- The income dataset was extracted from 1994 U.S. Census database. The census is a special, wide-range activity, which takes place once a decade in the entire country. The purpose is to gather information about the general population, in order to present a full and reliable picture of the population in the country - its housing conditions and demographic, social and economic characteristics. The information collected includes data on age, gender, country of origin, marital status, housing conditions, marriage, education, employment, etc.
- **Goal:** Predict whether a person makes over 50K a year or not given their demographic variation.

Use **random_state = 2023** wherever applicable.

## Task 1: Data Exploration (6 points)
Explore the yelp dataset and generate the word cloud for reviews with stars == 1 and stars == 5, respectively. Check the [word_cloud](https://github.com/amueller/word_cloud) library. (4 points)

An example of word cloud is showed below. 

![word cloud](https://static.commonlounge.com/fp/600w/FxEgN5woHmXOJOLtm7oGGenV81520493685_kc)

What do you observe? Does the generated word cloud match your expectation? Elaborate. (2 points)

## Task 2: Linear Regression (41 points)

In this task, you are to build a linear regression model to predict the stars based solely on the text feature. You need to follow the below steps.
1. Split the data into train (80%) and test (20%) using random_state = 2023
2. Use [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to generate a vector representation of the text. Use ngram_range = (1, 2), min_df=0.01 and stop_words = 'english'
3. Build a linear regression model using the default parameters. 


Explain how CountVectorizer works based on the documentation from scikit-learn.org. (2 points)

Explain what do ngram_range and min_df mean? (2 points)

What is the RMSE score on the train and test dataset, respectively? What do you observe? (3 points)

Based on the same train/test split, 
- use [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) to generate a vector representation of the text. Use ngram_range = (1, 2), min_df=0.01 and stop_words = 'english'.  (1 point)
- Build a linear regression model using default parameters and report RMSE score on the train and test dataset. Do you get better performances using TfidfVectorizer instead of CountVectorizer? (3 points)

Explain how TfidfVectorizer works based on the documentation from scikit-learn.org. (2 points)

Based on the TfidfVectorizer, list the 10 most important features based on the magnitude of the coefficients. Explain whether these attributes make sense. (3 points) 

Hint 
1. check the documentation of linear regression on how to get the model coefficients. 
2. It's easier to generate a dataframe with feature names and their corresponding coefficients.

Another way of converting a text into a vector is by using word embedding (word vector), which is a technique used in NLP that maps words or phrases from a vocabulary into continuous vectors of fixed dimensions in a latent space. You may use a pre-trained word embedding model. An example is showed below.

In [9]:
from gensim.models import KeyedVectors

# Load the pre-trained Word2Vec model
model_path = "GoogleNews-vectors-negative300.bin.gz"
word2vec_model = KeyedVectors.load_word2vec_format(model_path, binary=True)

def preprocess(text):
    # A simple text preprocessing function that tokenizes words, removes punctuation, and converts to lowercase
    words = text.lower().split()
    return [word.strip('.,!?()"') for word in words]


def get_word_vector(word, model):
    try:
        return model[word]
    except KeyError:
        return None

In [None]:
document = "The quick brown fox jumps over the lazy dog."

# This gives you the word embedding of each word in the document
embeddings = get_word_vector(preprocess(document), word2vec_model)

Now that you have the word vectors for each word in the document, you need to combine them to create a single vector representing the entire document. There are several strategies for doing this:
- a. Average: Compute the element-wise average of the word vectors. 
- b. Weighted Average: Compute a weighted average of the word vectors, where the weights are based on factors such as term frequency, inverse document frequency, or word importance in the document.
- c. Max Pooling: Take the maximum value for each dimension across all word vectors. This approach can help retain some information about the most important words in the document.

You task is to:
- Implement all these approaches (9 points)
- Train a linear regression model for each of these and report the RMSE on both train and test datasets.  (3 points)

Note: For approach c, use TfidfVectorizer.

Using TfidfVectorizer, build 4 Lasso regression models with alpha in [1e-7, 1e-6, 1e-5, 1e-4] and create a table with the below schema. Think about what is model complexity here. Comment on the observations. (8 points)

| alpha  | Training_RMSE | Model Complexity | Test RMSE |
|--------|---------------|------------------|-----------|
| 1e-7 |               |                  |           |
| 1e-6  |               |                  |           |
| 1e-5   |               |                  |           |
| 1e-4    |               |                  |           |

Buld 5 Ridge regression models with alpha in [0.01, 0.1, 1, 10, 100] and create a table with the below schema. Think about what is model complexity here. Comment on the observations. (5 points)

| alpha  | Training_RMSE | Model Complexity | Test RMSE |
|--------|---------------|------------------|-----------|
| 0.01 |               |                  |           |
| 0.1  |               |                  |           |
| 1   |               |                  |           |
| 10    |               |                  |           |
| 100    |               |                  |           |

## Task 3: Naive Bayes Classifier (13 points)


In this task, you are to build a Naive Bayes model to predict the highest and lowest stars based on the text feature. You need to follow the below steps.
1. Create a new DataFrame that contains only the 5-star and 1-star reviews. 
2. Split the data into train (80%) and test (20%) using random_state = 2022 
3. Use [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)  to generate a vector representation of the text. Use ngram_range = (1, 2) and min_df=0.01
4. Use the default parameters for the [MultinomiaNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html) model.


Generate the confusion matrices on both train and test sets. What do you observe? (3 points)

Calculate which 10 tokens are the most predictive of **5-star reviews**, and which 10 tokens are the most predictive of **1-star reviews**. (6 points)

Hint
1. Naive Bayes automatically counts the number of times each token appears in each class, as well as the number of observations in each class. You can access these counts via the `feature_count_` and `class_count_` attributes of the Naive Bayes model object.
2. Define a metric to indicate the predictiveness of a token for 5-star or 1-star reviews.

Comment on the top tokens of 5-star and 1-star reviews. Do these tokens make sense? (4 points)

## Task 4: Logistic Regression (31 points)

In this task, you are to build a logistic regression model to predict whether a person makes over 50k a year or not given their demographicvariations. Please take note on the following data dictionary.

Categorical Attributes

- workclass: (categorical) Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
    - Individual work category
- education: (categorical) Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
    - Individual's highest education degree
- marital-status: (categorical) Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
    - Individual marital status
- occupation: (categorical) Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
    - Individual's occupation
- relationship: (categorical) Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
    - Individual's relation in a family
- race: (categorical) White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
    - Race of Individual
- sex: (categorical) Female, Male.
- native-country: (categorical) United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
    - Individual's native country


Continuous Attributes

- age: continuous.
    - Age of an individual
- education-num: number of education year, continuous.
    - Individual's year of receiving education
- capital-gain: continuous.
- capital-loss: continuous.
- hours-per-week: continuous.
- Individual's working hour per week

In order to train any machine learning models, you need to first generate numericl representations for all features. A popular approach to deal with categorical features is called [one-hot encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html). Explain what is one-hot encoding based on the sklearn documentation. (2 points)

To avoid a very long one-hot vector due to too many unique values of a particular categorical variable, we often combine certain values together. 

- Use the below dictionaries to map the education, marital-status and native-country columns to new columns. (1 point)

In [10]:
education_mapping = {
    'Preschool': 'dropout',
    '10th': 'dropout',
    '11th': 'dropout',
    '12th': 'dropout',
    '1st-4th': 'dropout',
    '5th-6th': 'dropout',
    '7th-8th': 'dropout',
    '9th': 'dropout',
    'HS-grad': 'HighGrad',
    'Some-college': 'CommunityCollege',
    'Assoc-acdm': 'CommunityCollege',
    'Assoc-voc': 'CommunityCollege',
    'Bachelors': 'Bachelors',
    'Masters': 'Masters',
    'Prof-school': 'Masters',
    'Doctorate': 'Doctorate'
}


marital_status_mapping = {
    'Never-married': 'NotMarried',
    'Married-AF-spouse': 'Married',
    'Married-civ-spouse': 'Married',
    'Married-spouse-absent': 'NotMarried',
    'Separated': 'Separated',
    'Divorced': 'Separated',
    'Widowed': 'Widowed'
}

native_country_mapping = {
    'United-States': 'North America',
    'Cuba': 'North America',
    'Jamaica': 'North America',
    'India': 'Asia',
    '?': 'Unknown',
    'Mexico': 'North America',
    'South': 'Unknown',
    'Puerto-Rico': 'North America',
    'Honduras': 'North America',
    'England': 'Europe',
    'Canada': 'North America',
    'Germany': 'Europe',
    'Iran': 'Asia',
    'Philippines': 'Asia',
    'Italy': 'Europe',
    'Poland': 'Europe',
    'Columbia': 'South America',
    'Cambodia': 'Asia',
    'Thailand': 'Asia',
    'Ecuador': 'South America',
    'Laos': 'Asia',
    'Taiwan': 'Asia',
    'Haiti': 'North America',
    'Portugal': 'Europe',
    'Dominican-Republic': 'North America',
    'El-Salvador': 'North America',
    'France': 'Europe',
    'Guatemala': 'North America',
    'China': 'Asia',
    'Japan': 'Asia',
    'Yugoslavia': 'Europe',
    'Peru': 'South America',
    'Outlying-US(Guam-USVI-etc)': 'North America',
    'Scotland': 'Europe',
    'Trinadad&Tobago': 'North America',
    'Greece': 'Europe',
    'Nicaragua': 'North America',
    'Vietnam': 'Asia',
    'Hong': 'Asia',
    'Ireland': 'Europe',
    'Hungary': 'Europe',
    'Holand-Netherlands': 'Europe'
}

Apply the below log transform for capital-gain and capital-loss. (1 point)
- lambda $x$: $log_{10}(x + 1)$

Now apply one-hot encoding to all categorical variables. (4 points)

Split the data into train and test dataset (80% and 20%, respectively) and build a logistic regression using default configuration. (3 points)

Generate the confusion matrices for train and test set. What do you observe? (3 points)

From the test set, take a look at the two false positives with highest predicted probabilities and the two false negatives with the lowest predicted probabilities. Why do you think the model predict them as positive/negative? (4 points)

Build five L2 regularized logistic regression models with C in [0.01, 0.1, 1, 10, 100] and create a table with the below schema. Think about what is model complexity here. Comment on the observations.(3 points)

| C  | Train AUC | Model Complexity | Test AUC |
|--------|---------------|------------------|-----------|
| 0.01  |               |                  |           |
| 0.1   |               |                  |           |
| 1    |               |                  |           |
| 10    |               |                  |           |
| 100   |               |                  |           |

Plot the ROC curves for the five models you have built above. Which model has the best performance? (4 points)

An example ROC plot is showed below.

<img src="https://i.stack.imgur.com/SFA9h.png" width="500"/>

List the 20 most important features based on the magnitude of the coefficients of the best performing model. Do the top three features make sense? (3 points)

Build five L1 regularized logistic regression models with C in [ 1, 10, 100, 1000, 10000] and create a table with the below schema. Think about what is model complexity here. Comment on the observations. (3 points)

| C  | Train AUC | Model Complexity | Test AUC |
|--------|---------------|------------------|-----------|
| 1    |               |                  |           |
| 10    |               |                  |           |
| 100   |               |                  |           |
| 1000 |               |                  |           |
| 10000  |               |                  |           |