<a href="https://colab.research.google.com/github/harperluthy/DS1001-LABS3-Projects/blob/main/LABS_09_Analytics_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LABS-9: Analytics Project

In this notebook you will run and edit the code to perform some data cleaning and run a basic kNN model.

> **What is kNN?**\
k-Nearest Neighbors (kNN) is a machine learning algorithm used for classification tasks. At its core, kNN works by measuring the "distance" between data points. When you want to predict the category of a new data point, kNN looks at the 'k' closest points in your dataset (its "neighbors") and assigns the most common category among those neighbors to the new point.

> **How does kNN measure distance?**\
kNN relies on distance metrics - commonly Euclidean distance (the straight-line distance between two points) - to find which data points are most similar. The algorithm compares all features (columns) in your dataset, so it's important that these features are numeric or converted to a format where distances can be calculated.

> **How will we use kNN here?**\
In this notebook, we'll use kNN for classification: predicting whether a movie will receive a "high" or "low" score based on its features (like genre, country, budget, and more). By carefully cleaning and formatting our data, we ensure that kNN can measure distances and make meaningful predictions.

**Data**\
This dataset comes from IMDB and can be accessed on [Kaggle](https://www.kaggle.com/datasets/ashpalsingh1525/imdb-movies-dataset).

## Set up environment

In [1]:
## import packages

import pandas as pd #data ingestion & cleaning
import numpy as np #numbers

# modeling
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import kagglehub

In [2]:
# Download latest version
path = kagglehub.dataset_download("ashpalsingh1525/imdb-movies-dataset")

print("Path to dataset files:", path)

Using Colab cache for faster access to the 'imdb-movies-dataset' dataset.
Path to dataset files: /kaggle/input/imdb-movies-dataset


In [6]:
# Read in data
data = pd.read_csv('/kaggle/input/imdb-movies-dataset/imdb_movies.csv')

# view data
display(data.head())

Unnamed: 0,names,date_x,score,genre,overview,crew,orig_title,status,orig_lang,budget_x,revenue,country
0,Creed III,03/02/2023,73.0,"Drama, Action","After dominating the boxing world, Adonis Cree...","Michael B. Jordan, Adonis Creed, Tessa Thompso...",Creed III,Released,English,75000000.0,271616700.0,AU
1,Avatar: The Way of Water,12/15/2022,78.0,"Science Fiction, Adventure, Action",Set more than a decade after the events of the...,"Sam Worthington, Jake Sully, Zoe Saldaña, Neyt...",Avatar: The Way of Water,Released,English,460000000.0,2316795000.0,AU
2,The Super Mario Bros. Movie,04/05/2023,76.0,"Animation, Adventure, Family, Fantasy, Comedy","While working underground to fix a water main,...","Chris Pratt, Mario (voice), Anya Taylor-Joy, P...",The Super Mario Bros. Movie,Released,English,100000000.0,724459000.0,AU
3,Mummies,01/05/2023,70.0,"Animation, Comedy, Family, Adventure, Fantasy","Through a series of unfortunate events, three ...","Óscar Barberán, Thut (voice), Ana Esther Albor...",Momias,Released,"Spanish, Castilian",12300000.0,34200000.0,AU
4,Supercell,03/17/2023,61.0,Action,Good-hearted teenager William always lived in ...,"Skeet Ulrich, Roy Cameron, Anne Heche, Dr Quin...",Supercell,Released,English,77000000.0,340942000.0,US


In [5]:
import os

# List files in the directory
directory_path = '/kaggle/input/imdb-movies-dataset'
files = os.listdir(directory_path)
print(files)

['imdb_movies.csv']


## Data Cleaning & Model Prep

Before building a machine learning model, it is essential to clean and format the data. Raw data often contains missing values, inconsistent formats, or irrelevant information that can negatively impact or break a model.\
Many algorithms, including kNN, require numeric input or specificly formatted categorical data. By cleaning the data (removing or imputing missing values, converting strings to categorical variables, and creating dummy variables), we ensure that our dataset is structured in a way that the model can interpret and learn from effectively.

Proper data preparation leads to more accurate, reliable, and interpretable results.

There are many decisions that get made throughout this process and there is often no "right" answer - so documentating why you do things as you clean data is **key**.

### Missing Values

We saw in our design lab that some of our columns are missing values. Many models can not tolerate missing data (they will break the model), so we have to deal with these before passing the data through to our model.

We can use the [`.info()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html#pandas.DataFrame.info) method to see what columns are missing data. Run this below (look back at LABS-06 if you don't remember how).

In [7]:
# ADD THE CODE TO RUN .INFO() ON THE DATA HERE



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10178 entries, 0 to 10177
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   names       10178 non-null  object 
 1   date_x      10178 non-null  object 
 2   score       10178 non-null  float64
 3   genre       10093 non-null  object 
 4   overview    10178 non-null  object 
 5   crew        10122 non-null  object 
 6   orig_title  10178 non-null  object 
 7   status      10178 non-null  object 
 8   orig_lang   10178 non-null  object 
 9   budget_x    10178 non-null  float64
 10  revenue     10178 non-null  float64
 11  country     10178 non-null  object 
dtypes: float64(3), object(9)
memory usage: 954.3+ KB


2 columns are missing data: `genre` and `crew`.\
Since we have a large data set for kNN, we can drop the relatively few rows that are missing data using .dropna()

In [8]:
## make a new df to make changes to
model_data = data.copy()

In [9]:
# drop rows containing NaN (missing) values
model_data.dropna(inplace=True)

Now that we've dropped rows with missing values, our dataset is free of NaNs.
>The output of the cell below (`model_data.info()`) confirms that all columns are complete and can be used to answer **question 1** regarding how much data was removed.

In [10]:
model_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10052 entries, 0 to 10177
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   names       10052 non-null  object 
 1   date_x      10052 non-null  object 
 2   score       10052 non-null  float64
 3   genre       10052 non-null  object 
 4   overview    10052 non-null  object 
 5   crew        10052 non-null  object 
 6   orig_title  10052 non-null  object 
 7   status      10052 non-null  object 
 8   orig_lang   10052 non-null  object 
 9   budget_x    10052 non-null  float64
 10  revenue     10052 non-null  float64
 11  country     10052 non-null  object 
dtypes: float64(3), object(9)
memory usage: 1020.9+ KB


### Look at Columns

When preparing data for kNN, it's important to look at the values in each column, especially for columns with categories (like country or genre).\
kNN works by measuring the distance between data points to find the ones that are most similar, so if a column has too many different categories, it can make these distance calculations confusing and less useful. If some categories only appear a few times, they don't help much and can make the model less accurate.

To fix this, we can group these less common categories into a single 'other' category. This makes the data simpler and helps kNN focus on the most useful information when measuring distances.

In [11]:
data['status'].value_counts()

Unnamed: 0_level_0,count
status,Unnamed: 1_level_1
Released,10131
Post Production,31
In Production,16


The `status` column has 3 levels (3 distrinct values in the column). This works great for kNN!

Now let's take a look at `country`.

In [12]:
data['country'].value_counts()

Unnamed: 0_level_0,count
country,Unnamed: 1_level_1
AU,4885
US,2750
JP,538
KR,361
FR,222
GB,174
ES,153
HK,125
IT,123
MX,105


`country` has far too many columns to use in kNN. We need to collapse the smaller countries into a single "other" category.

In the next code cell, you will run the code to do this.

#### Collapse the `country` column

The cell below contains the code to collapse the `country` column reassign any coutries with less occurances than the threshold to have the value 'other'.

> *Hint: Look at the variable assignment in the cell below to identify where the threshold for grouping countries is set to help answer **question 3**. Notice where this is used in the line below the does the actual collapsing.*

In [13]:
threshold = 100
model_data['country'] = model_data['country'].apply(lambda x: x if model_data['country'].value_counts()[x] > threshold else 'other')

In [14]:
# check the new counts after collapsing

model_data['country'].value_counts()


Unnamed: 0_level_0,count
country,Unnamed: 1_level_1
AU,4880
US,2716
other,709
JP,502
KR,358
FR,219
GB,172
ES,151
IT,123
HK,120


Now the `country` column has 11 levels, including our new "other" category. This is still a little large for kNN, but we can always come back and adjust this if we find our model needs some tinkering.

#### Look at other problematic columns

Now let's check out some of the other categorical columns that we'll use and see if we need to collapse or simplify any of them.

In [15]:
# Check values for 'genre' column

model_data['genre'].value_counts()

Unnamed: 0_level_0,count
genre,Unnamed: 1_level_1
Drama,556
Comedy,373
"Drama, Romance",268
Horror,258
"Horror, Thriller",202
...,...
"Adventure, Family, Mystery, Science Fiction",1
"Mystery, Drama, Action, Crime",1
"Thriller, Comedy, Action",1
"Drama, Romance, Horror",1


In [16]:
# Check values for 'orig_lang' column

model_data['orig_lang'].value_counts()

Unnamed: 0_level_0,count
orig_lang,Unnamed: 1_level_1
English,7381
Japanese,675
"Spanish, Castilian",388
Korean,384
French,282
Chinese,144
Italian,142
Cantonese,141
German,89
Russian,65


Both the `genre` and `orig_lang` columns often contain multiple values separated by commas within a single cell (for example, "Animation, Adventure, Family" or "Spanish, Castilian"). This means that instead of just one value, each cell can list several genres or languages for a movie.

This is too complex for machine learning models like kNN, which work best when each column contains just one clear value per row. To make things simpler and easier to model, we will only keep the first value from each cell. This way, each movie will have just one genre and one language listed, making our data cleaner and better suited for the modeling we are doing.


#### Get the first value for `genre` and `orig_lang`


This process happens in three steps:

1. **Define the function:**  
    To simplify columns that contain lists of values, we we use a **function** - a reusable block of code that performs a specific task. Think of a function like a mini-program: you give it some input, it does something for you, and gives you back a result. Here, our function will extract just the first value from each cell to simplify the data so each row has only one value per column.

2. **Apply the function:**  
    We then apply this function to the `genre` and `orig_lang` columns, creating new columns called `top_genre` and `top_lang` with just the top value for each movie.

3. **Recheck the counts:**  
    After this step, we check how many unique values remain in these simplified columns to confirm that our data is now easier to work with.

In [17]:
# define the function to simplify the columns

def get_top_value(old_column_name, new_column_name):
    """
    Function to extract the first value from a column that contains multiple comma seperated values
    Appends a new column to the dataframe
    """

    col = list(model_data[old_column_name].values)

    top_list = []
    for item in col:
        item = str(item).split(",")
        item1 = item[0]
        top_list.append(item1)

    model_data[new_column_name] = top_list

In [18]:
# apply the function to the 'genre' and 'orig_lang' columns

get_top_value('genre', 'top_genre')
get_top_value('orig_lang', 'top_lang')

In [19]:
# Check the counts again

display(model_data['top_genre'].value_counts(),
        model_data['top_lang'].value_counts())

Unnamed: 0_level_0,count
top_genre,Unnamed: 1_level_1
Drama,1865
Action,1563
Comedy,1377
Horror,931
Animation,885
Thriller,577
Adventure,571
Romance,413
Crime,371
Family,333


Unnamed: 0_level_0,count
top_lang,Unnamed: 1_level_1
English,7381
Japanese,675
Spanish,388
Korean,384
French,282
Chinese,144
Italian,142
Cantonese,141
German,89
Russian,65


The `top_genre` column now has 19 different values, which is easier for kNN to work with than the original genre data that had lots of combinations. By simplifying this column, the distance calculations in the model will have more meaning. If we notice that some genres are still very rare or if our model isn’t working well, we can always come back and group those rare genres into an "other" category to make things even simpler.


For the `top_lang` column, we’ve also made things easier by keeping only the main language for each movie. But some languages only show up a few times, which can overly-complicate the distance calculations in the model. To fix this, we should group these less common languages into an "other" category, just like we did for the `country` column. This helps the most important languages have meaning and avoids over-complicating the model with the ones that don’t appear often.

The cell below contains the code to collapse the `top_lang` column reassign any languages with less occurances than the threshold to have the value 'other'.

> *Hint: Look at the variable assignment in the cell below to identify where the threshold for grouping languages is set to help answer **question 3**. Notice where this is used in the line below the does the actual collapsing.*

In [20]:
#collapse top_lang

threshold = 10
model_data['top_lang'] = model_data['top_lang'].apply(lambda x: x if model_data['top_lang'].value_counts()[x] > threshold else 'other')

### Reformat columns

For kNN modeling, each column in your dataset should be in a format that the algorithm can easily understand and compare. This means:

- **Numeric columns** (like `budget_x`, `revenue`, and `date_x`) should contain numbers only, and be formatted as such so the model can measure distances between values. We also need to scale these columns so that the values are in the same range, which ensures all features contribute equally rather than letting large-range variables dominate.
- **Categorical columns** (like `country`, `top_genre`, and `top_lang`) should be converted into a format the model can use. One common way to do this is by creating dummy variables (also called one-hot encoding). For example, if you have a column called `top_genre` with values like "Action", "Comedy", or "Drama", you make a new column for each possible genre: one column for "Action", one for "Comedy", one for "Drama", and so on. In each new column, you put a `1` (or `True`) if the movie belongs to that genre, and a `0` (or `False`) if it does not. This allows for distances between values to be calculated.
- **The variable you are predicting** (in this case `score`) needs to be categorical, such as "high" or "low", so kNN can classify new data into one of these categories. While kNN *can* be used for regression (predicting specific values), it is best suited for classification - which is how we will be using it here.

By carefully reformatting each column, we ensure that kNN can accurately measure the "distance" between movies based on meaningful, comparable features—leading to better predictions and more reliable results.

#### Date

The original `date_x` column contains full date information (month, day, year), but for our analysis we will simplify the data and make it easier for kNN to compare movies based on their release year. This also helps avoid unnecessary complexity from day/month differences that aren't meaningful for our predictions.

This is another place where if we find our model does not perform well, we may want to revisit this decision to only keep the year, as the month a movie gets released may have some predictive power as well.

**Step 1: Convert the `date_x` column to datetime format**  
First, we use a pandas function to change the `date_x` column from a string (like "03/02/2023") into a pandas datetime object. This makes it easy to work with dates and extract parts like the year.

**Step 2: Extract the year from the datetime column**  
Once the column is in datetime format, we can use pandas to pull out just the year for each movie and save it in a new column called `year`. This gives us a simple numeric value that is much easier for kNN to use when comparing movies.

By following these steps, we transform a complex date into a single, meaningful feature that helps our model focus on the most relevant information.


In [21]:
## Convert to datetime format

model_data['date_x'] = pd.to_datetime(model_data['date_x'])

In [22]:
## extract the year and save as a new column 'year'
# this will save as an integer

model_data['year'] = model_data['date_x'].dt.year

#### Score

To use kNN for classification, we need to convert the movie scores from numbers to categories. In this case, we will use a 2-level classification: "high" or "low" scores.\
This will allow us to predict whether a movie will receive a "high" or "low" score based on the characteristics in our dataset.

In order to convert our score from numeric to categorical, we must determine a "threshold" to seperate the "high" and "low" scores.\
Use the density plot of the score distribution you created in LABS_06-Design to pick a threshold. Then:
 - Enter your chosen threshold into the code cell below before running it.
 - Provide an explanation for why you chose that threshold to answer **question 5**.

> ***Remember**: This is another situation where there is no "right answer" on what to choose.*\
*This is a decision point where you should make an informed choice and note that you can adjust it later if your model performance warrants it.*

In [23]:
## reformat score

score_threshold = 68
model_data['score'] = model_data['score'].apply(lambda x: 'high' if model_data['score'][x] > score_threshold else 'low')
model_data.head()

Unnamed: 0,names,date_x,score,genre,overview,crew,orig_title,status,orig_lang,budget_x,revenue,country,top_genre,top_lang,year
0,Creed III,2023-03-02,low,"Drama, Action","After dominating the boxing world, Adonis Cree...","Michael B. Jordan, Adonis Creed, Tessa Thompso...",Creed III,Released,English,75000000.0,271616700.0,AU,Drama,English,2023
1,Avatar: The Way of Water,2022-12-15,low,"Science Fiction, Adventure, Action",Set more than a decade after the events of the...,"Sam Worthington, Jake Sully, Zoe Saldaña, Neyt...",Avatar: The Way of Water,Released,English,460000000.0,2316795000.0,AU,Science Fiction,English,2022
2,The Super Mario Bros. Movie,2023-04-05,high,"Animation, Adventure, Family, Fantasy, Comedy","While working underground to fix a water main,...","Chris Pratt, Mario (voice), Anya Taylor-Joy, P...",The Super Mario Bros. Movie,Released,English,100000000.0,724459000.0,AU,Animation,English,2023
3,Mummies,2023-01-05,high,"Animation, Comedy, Family, Adventure, Fantasy","Through a series of unfortunate events, three ...","Óscar Barberán, Thut (voice), Ana Esther Albor...",Momias,Released,"Spanish, Castilian",12300000.0,34200000.0,AU,Animation,Spanish,2023
4,Supercell,2023-03-17,low,Action,Good-hearted teenager William always lived in ...,"Skeet Ulrich, Roy Cameron, Anne Heche, Dr Quin...",Supercell,Released,English,77000000.0,340942000.0,US,Action,English,2023


The score variable is now stored as strings of "high" or "low", but for kNN modeling, it must be converted and saved as a categorical variable because kNN groups data based on distinct labels. If the score is stored as a string, the algorithm may treat each unique text as a separate class, but converting it to a categorical type ensures consistent, clear groupings for accurate classification.

In [24]:
# reformat from string to category

model_data['score'] = model_data['score'].astype('category')

#### Scale Numeric Columns

The [MinMaxScaler()](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler) transforms all of our numeric columns to be on a scale of 0-1. This avoids any very large numbers (like `revenue` and `budget_x`) from overpowering the distance calculations and skewing the predictions.

In [25]:
# Select only numeric columns
numeric_cols = model_data.select_dtypes(include='number').columns

# Scale numeric columns
model_data[numeric_cols] = MinMaxScaler().fit_transform(model_data[numeric_cols])

In [26]:
## check the data types of each column
# note that score is now a 'category'

model_data.dtypes

Unnamed: 0,0
names,object
date_x,datetime64[ns]
score,category
genre,object
overview,object
crew,object
orig_title,object
status,object
orig_lang,object
budget_x,float64


### Drop extraneous columns

Some columns in our dataset are not useful for kNN modeling because they contain text, complex lists, or information that cannot be easily converted into numeric or categorical formats for distance calculations. For example, columns like `names`, `overview`, `crew`, and `orig_title` contain descriptive text or lists of people, which do not help the kNN algorithm calculate distances to compare movies in a meaningful way.

Additionally, for columns that were adjusted—such as `date_x`, `genre`, and `orig_lang`: we created new, simplified versions (`year`, `top_genre`, and `top_lang`) that are more suitable for kNN. After these adjustments, we drop the original columns to avoid redundancy and ensure our model only uses clean, relevant features.

By removing these extraneous columns, we streamline our dataset and focus on the features that will help kNN make accurate predictions.

In [27]:
## Drop the columns we won't use

model_data = model_data.drop(columns=['status', 'date_x', 'names', 'genre', 'overview', 'crew', 'orig_title', 'orig_lang'])
# Export cleaned data to CSV for later use
model_data.to_csv("imdb_movies_cleaned.csv", index=False)

#view the data now
model_data.head()

Unnamed: 0,score,budget_x,revenue,country,top_genre,top_lang,year
0,low,0.163043,0.092901,AU,Drama,English,1.0
1,low,1.0,0.792417,AU,Science Fiction,English,0.991667
2,high,0.217391,0.247788,AU,Animation,English,1.0
3,high,0.026739,0.011697,AU,Animation,Spanish,1.0
4,low,0.167391,0.116613,US,Action,English,1.0


### Seperate features from target

It is best practice to separate the columns used for prediction from the column you want to predict into distinct variables. This makes your code easier to read and helps you avoid mistakes, like accidentally using the answer to help make predictions. Most machine learning tools expect you to give them features and targets separately, so organizing your data this way makes your workflow smoother and less confusing.

In [28]:
# features: all columns except 'score'
features = model_data.drop('score', axis=1)

# Target: score column
target = model_data['score']

#### Dummy variables

Dummy variables are created by transforming each category in a column into its own separate column, where each row is marked with a 1 or 0 (coded as True or False) to indicate the presence or absence of that category. This process, called one-hot encoding, ensures that all categorical features are represented numerically, making them compatible with algorithms like kNN that rely on distance calculations.

For example, the `country` column contains values like "US", "AU", and "other", one-hot encoding will create new columns: `country_US`, `country_AU`, and `country_other`. Each row will have True in the column matching its country and False in the others. Similarly, for the `top_genre` column with values "Action", "Comedy", and "Drama", new columns `top_genre_Action`, `top_genre_Comedy`, and `top_genre_Drama` are created, with True/False indicating the genre for each movie.

By converting categorical data into dummy variables, we prepare our dataset for effective modeling and comparison.

In [29]:
## create dummy variables for the features dataframe
features = pd.get_dummies(features)

# preview the new features dataframe
features.head()

Unnamed: 0,budget_x,revenue,year,country_AU,country_ES,country_FR,country_GB,country_HK,country_IT,country_JP,...,top_lang_ Norwegian,top_lang_ Polish,top_lang_ Portuguese,top_lang_ Russian,top_lang_ Spanish,top_lang_ Swedish,top_lang_ Tagalog,top_lang_ Thai,top_lang_ Turkish,top_lang_other
0,0.163043,0.092901,1.0,True,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,1.0,0.792417,0.991667,True,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,0.217391,0.247788,1.0,True,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,0.026739,0.011697,1.0,True,False,False,False,False,False,False,...,False,False,False,False,True,False,False,False,False,False
4,0.167391,0.116613,1.0,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


### Train/Test split
Splitting the data into a **train set** and a **test set** is essential for building reliable machine learning models. The train set provides patterns and relationships for the algorithm to 'learn', while the test set is kept separate to evaluate performance on new, unseen data. This process helps identify overfitting - when an algorithm fits the training data too closely and fails to generalize - and ensures that results reflect true predictive power. Comparing accuracy on both sets allows for a confident assessment of how well the approach will work in real-world scenarios.

We use the [`train_test_split()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function from scikit-learn to perform this split. This function randomly divides the dataset into training and testing subsets, making it easy to control the size of each set and ensure reproducibility.

> Use the linked documentation and the code below to help answer **question 8**.

In [30]:
# train test split
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.2, random_state=45)


## Create the model

To build our kNN model, we must create our model, then train, test, and evaluate it. These steps are explained in more detail below.

**Step 1: Create model - Initialize the classifier object**  
We create a kNN classifier object using [`KNeighborsClassifier()`](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html). This sets up the algorithm and specifies the nearest neighbors to classify each movie.

**Step 2: Train - Fit to the training data**  
We train the kNN model by calling [`.fit()`](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier.fit). This step allows the classifier to learn patterns from the training data so it can make predictions.

**Step 3: Test - Predict on the testing data**  
We use the trained model to predict the score category ("high" or "low") for the movies in our test set with [`.predict()`](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier.predict). This generates predictions for data the model has not seen before.

**Step 4: Evaluate - Calculate the accuracy of the testing data predictions**  
We evaluate how well our model performed by comparing the predicted categories to the actual categories in the test set using [`accuracy_score()`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html). This gives us a percentage that reflects the proportion of correct predictions.

In [148]:
# Create the model

knn = KNeighborsClassifier(n_neighbors=25)

In [149]:
# Train the model

knn.fit(features_train, target_train)

In [150]:
# Test the model

target_predicted = knn.predict(features_test)

In [151]:
# Evaluate the model

# Predict on the training data
target_train_predicted = knn.predict(features_train)

# Calculate and print training accuracy
print("Training Accuracy:", accuracy_score(target_train, target_train_predicted))

# Calculate and print testing accuracy
print("Testing Accuracy:", accuracy_score(target_test, target_predicted))

Training Accuracy: 0.5989304812834224
Testing Accuracy: 0.5181501740427648


## Your Turn: Adjust the model

You will now explore how the number of neighbors (`k`) affects your kNN model's accuracy. Follow these steps:

1. **Change the value of `n_neighbors`** in the cell where the kNN model is created. Try at least 5 different values.
1. **For each k value**, run *all* model building cells:  
    - Create the model  
    - Train the model  
    - Test the model  
    - Evaluate the accuracy  
1. **Record the accuracy** for each k value you try.
1. **Choose the best k** based on your results and explain your reasoning to answer **quesiton 10**.

> *Tip: The "best" k is usually the one that gives the highest accuracy on the test set, but consider if the accuracy is stable or if the model seems to be overfitting or underfitting at certain values.*