## This homework assesses your ability of building and deploying your applications. 


`start date: Nov 22nd 11:59 PM` <br>
`due date: Dec 5th 11:59 PM`

### Make sure you submitted your submissions to BrightSpace.

`Total credits:  63/53`

You're welcome to share your thoughts about the homework and the course materials here: https://forms.gle/Kd9AoUZwkMiF8Vx5A

# P1: File path related questions

In the following structure, `project`, `data`, and `scripts` are folder names. `my_notebook.ipynb` is the Jupyter Notebook you used to create these folders.

```python
project/
    data/
        file_name.csv
    scripts/
        script.py
my_notebook.ipynb  

```

`P1.1   3 pts` Suppose I'm currently at the same level as the project folder, meaning I am outside the project folder but within the same parent directory. I want to open the file_name.csv file using the following code written in my my_notebook.ipynb:

```python
df = pd.read_csv('file_name.csv')

```
Will this code work? If not, how should I revise it? 

No, the code won't work because the file is inside the project/data/ folder, while the notebook is outside the project folder. Update the code to provide the correct relative path:

df = pd.read_csv('project/data/file_name.csv')

`p1.2 3 pts`  Suppose you are now in the `data` folder, and you want to re-write your `script.py` file using the `%%writefile` method while reflecting all your actions (e.g., cd commands, file path changes) within the `my_notebook.ipynb` file. Can you demonstrate how you would accomplish this?

In [None]:
# Step 1: Verify current working directory
import os
print(os.getcwd())  # To confirm we are in the `data` folder

# Step 2: Navigate to the `scripts` folder to ensure correct file location
os.chdir('../scripts/')  # Move to the scripts folder from the data folder
print(os.getcwd())  # Confirm we are now in the `scripts` folder

# Step 3: Use %%writefile to write the script
%%writefile script.py

def greet(name):
    return f"Hello, {name}!"

if __name__ == "__main__":
    print(greet("World"))

# Step 4: Navigate back to the `data` folder if needed
os.chdir('../data/')  # Return to the data folder
print(os.getcwd())  # Confirm we are back in the `data` folder

`p1.3 3 pts`  Suppose you are outside of the `project` folder but still within the same parent directory, and you want to re-write your `script.py` file using the `%%writefile` method while reflecting all your actions (e.g., cd commands, file path changes) within the `my_notebook.ipynb` file. Can you demonstrate how you would accomplish this? Assume you did everything within your `my_notebook.ipynb`. 

In [None]:
# Step 1: Verify the current working directory
import os
print(os.getcwd())  # Confirm that you are outside the `project` folder

# Step 2: Change to the `scripts` folder inside `project`
os.chdir('project/scripts')  # Navigate to the `scripts` folder
print(os.getcwd())  # Verify that you are now in the `scripts` folder

# Step 3: Use %%writefile to write the `script.py` file
%%writefile script.py
def greet(name):
    return f"Hello, {name}!"

if __name__ == "__main__":
    print(greet("World"))

# Step 4: Navigate back to the original directory (outside `project`)
os.chdir('../../')  # Move back to the original directory
print(os.getcwd())  # Confirm you are back in the original location

`p1.4 3 pts` Suppose you are outside of the project folder but still within the same parent directory, and you want to create a sub-folder named `test_data` under the `data` folder. Can you demonstrate how you would accomplish this? Assume you did everything within your `my_notebook.ipynb`. 

In [None]:
# Step 1: Verify the current working directory
import os
print(os.getcwd())  

# Step 2: Define the path for the `test_data` folder
test_data_path = os.path.join('project', 'data', 'test_data')

# Step 3: Create the `test_data` folder
os.makedirs(test_data_path, exist_ok=True)  

# Step 4: Confirm the folder was created
print(f"Folder 'test_data' created at: {os.path.abspath(test_data_path)}")

`p1.5 5 pts` Suppose you are outside of the project folder but still within the same parent directory, and you want to import your script.py file as a module. Your `script.py` file as the following contents:

```python
df = pd.read_csv('file_name.csv')
```

You have used the following codes to do the import within your `my_notebook.ipynb` file:

```python
import script
```

What are the issues with the script.py file ifself and also the way to import it? Explain bellow and fix it yourself. 


- script.py Issues:
Directly executes pd.read_csv during import, which is bad practice.
Assumes the CSV file is in the current working directory, causing errors if the script is imported from a different directory.
- Import Issues:
import script fails because Python doesn’t know the project/scripts folder location. It’s not in the sys.path.

In [None]:
# Updated script.py

import pandas as pd
import os

def load_csv(file_name):
    """Load a CSV file from the 'data' folder."""
    file_path = os.path.join('..', 'data', file_name)
    if os.path.exists(file_path):
        return pd.read_csv(file_path)
    else:
        raise FileNotFoundError(f"'{file_name}' not found at '{file_path}'")


In [None]:
# Import and Usage in my_notebook.ipynb

import os
import sys

# Add the scripts folder to sys.path
sys.path.append(os.path.join('project', 'scripts'))

# Import and use the script module
import script
df = script.load_csv('file_name.csv')


`p1.6 6 pts` Suppose you are outside of the project folder but still within the same parent directory, and you want to first import the class `clean_data`  from `script.py`, and then create an instance of the class. Your `script.py` file as the following contents:

```python
class clean_data:

    def __init__(self):
        df = pd.read_csv(`file_name.csv`)
```

You have used the following codes to import and create instance:

```python
import script

clean_data = clean_data()

```

What are the issues with the script.py file ifself(1 issue ) and also the way to import it (2 issues)  and create instance(1 issue) ? Explain bellow and fix it yourself. 

- Issue with script.py:
Hardcoded CSV file path in the constructor: The __init__ method directly reads file_name.csv without flexibility. This will fail if the file isn't in the working directory when the script runs or if a different file is needed.
- Issues with Import:
clean_data class not accessed correctly: The class is inside script.py, so it must be referenced as script.clean_data.
Module location issue: Python doesn’t know where project/scripts/script.py is, so you need to add it to sys.path.
- Issue with Instance Creation:
Shadowing the class name: The instance clean_data = clean_data() redefines clean_data as an instance, making the class inaccessible afterward.

In [None]:
# Updated script.py

import pandas as pd
import os

class clean_data:
    def __init__(self, file_name):
        file_path = os.path.join('..', 'data', file_name)
        if os.path.exists(file_path):
            self.df = pd.read_csv(file_path)
        else:
            raise FileNotFoundError(f"'{file_name}' not found at '{file_path}'")

In [None]:
# Import and Usage in my_notebook.ipynb:

import os
import sys

# Step 1: Add the scripts folder to sys.path
sys.path.append(os.path.join('project', 'scripts'))

# Step 2: Import the clean_data class
from script import clean_data

# Step 3: Create an instance of the class
cleaner = clean_data('file_name.csv')

# P2. `30 pts` Coding challenges (Individual version)

For this question, you will build a movie recommendation system using K-Nearest Neighbors (KNN) and create a webpage interface with Streamlit. The Streamlit webpage should allow the user to input a movie name they have watched before, and based on that input, the system will recommend 4 similar movies. Additionally, you will deploy your Streamlit app on AWS EC2.

For this question, I will place fewer restrictions on the choice of data and features, allowing you to mimic real-world decision-making as data professionals. In previous assignments, I guided you step-by-step to teach you how to correctly code each small part. Now that you’ve gained those foundational skills, this assignment will focus more on exercising your ability to design and solve problems independently.

You are free to use any dataset you like for this assignment. The movie dataset from HW8 is sufficient, but you are welcome to explore and use other datasets if you prefer.

Your final submission should include:

- A screenshot of the Streamlit webpage displaying your app and its recommendations.
- The source code used to create the application.

`Bonus 10 pts` use a pre-trained LLM model to add a short description to the movie provided by the user. 


In [8]:
%%writefile movies.py

from sklearn.preprocessing import MultiLabelBinarizer, MinMaxScaler
import pandas as pd
import numpy as np
import joblib

# Load the dataset
movies = pd.read_csv('movies.csv')

# Handle missing values
movies['genres'] = movies['genres'].fillna('')
movies['vote_average'] = movies['vote_average'].fillna(0)

# One-hot encode genres
movies['genres'] = movies['genres'].apply(lambda x: x.split(' '))
mlb = MultiLabelBinarizer()
genres_encoded = pd.DataFrame(mlb.fit_transform(movies['genres']), columns=mlb.classes_)

# Normalize the `vote_average` column
scaler = MinMaxScaler()
movies['vote_average_scaled'] = scaler.fit_transform(movies[['vote_average']])

# Combine the features
feature_matrix = pd.concat([genres_encoded, movies['vote_average_scaled']], axis=1)
feature_matrix = feature_matrix.fillna(0)

# Save the feature matrix for the app
joblib.dump(feature_matrix, 'movie_features.pkl')

from sklearn.neighbors import NearestNeighbors

# Convert feature matrix to a NumPy array
feature_matrix_np = feature_matrix.to_numpy()

# Train the KNN model
knn = NearestNeighbors(metric='cosine', algorithm='brute')
knn.fit(feature_matrix_np)

# Save the KNN model
joblib.dump(knn, 'knn_model.pkl')

import streamlit as st
import joblib
import pandas as pd
from sklearn.neighbors import NearestNeighbors

# Load data and models
movies = pd.read_csv('movies.csv')
feature_matrix = joblib.load('movie_features.pkl')
knn = joblib.load('knn_model.pkl')

# Streamlit interface
st.title("Movie Recommendation System")
movie_name = st.text_input("Enter a movie name you've watched:")

if movie_name:
    try:
        # Find the index of the input movie
        movie_index = movies[movies['title'].str.contains(movie_name, case=False, na=False)].index[0]
        distances, indices = knn.kneighbors([feature_matrix.iloc[movie_index]], n_neighbors=5)
        
        # Get recommended movies
        recommendations = movies.iloc[indices[0][1:]]['title']
        st.write("Movies you might like:")
        for rec in recommendations:
            st.write(f"- {rec}")
    except IndexError:
        st.error("Movie not found. Please try another.")


Writing movies.py
