## Steps for creating a user-based book recommendation system using k-Nearest Neighbors (k-NN)

1. Load and preprocess the data: Read the users.csv and ratings.csv files, and merge them based on the user ID.
2. Feature engineering: Convert categorical features (like gender and liked categories) into numerical features.
3. Train-test split: Split the data into training and testing sets.
4. Model training: Use a machine learning algorithm to train the model.
5. Model evaluation: Evaluate the model's performance.
6. Make predictions: Use the trained model to suggest books.

## Step 1: Load and preprocess the data

In [14]:
import pandas as pd

# Load the data
users = pd.read_csv('dataset/users.csv')
ratings = pd.read_csv('dataset/ratings.csv')

# Merge the data on User-ID
data = pd.merge(ratings, users, left_on='User-ID', right_on='user id')

# Drop unnecessary columns
data.drop(columns=['user id', 'name'], inplace=True)

# Display the first few rows of the merged data
print(data.head())

   User-ID           ISBN  Ratings  age  gender         liked categories
0        1  9789551319182        5   26  Female  Novels,Sci-Fi,Adventure
1        1  9789550980239        5   26  Female  Novels,Sci-Fi,Adventure
2        1  9789558415252        4   26  Female  Novels,Sci-Fi,Adventure
3        1  9786245594252        4   26  Female  Novels,Sci-Fi,Adventure
4        1     9551468031        3   26  Female  Novels,Sci-Fi,Adventure


## Step 2: Feature engineering

In [15]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer

# Encode categorical features
label_encoder = LabelEncoder()
data['gender'] = label_encoder.fit_transform(data['gender'])

# One-hot encode liked categories
data = data.join(data['liked categories'].str.get_dummies(sep=','))

# Drop the original liked categories column
data.drop(columns=['liked categories'], inplace=True)

# Display the first few rows of the processed data
print(data.head())

   User-ID           ISBN  Ratings  age  gender  Adventure  Biographie  Child  \
0        1  9789551319182        5   26       0          1           0      0   
1        1  9789550980239        5   26       0          1           0      0   
2        1  9789558415252        4   26       0          1           0      0   
3        1  9786245594252        4   26       0          1           0      0   
4        1     9551468031        3   26       0          1           0      0   

   Crime  Educational  Fiction  Historical  Mystrey  Novels  Sci-Fi  \
0      0            0        0           0        0       1       1   
1      0            0        0           0        0       1       1   
2      0            0        0           0        0       1       1   
3      0            0        0           0        0       1       1   
4      0            0        0           0        0       1       1   

   Short stories  Spy  Translations  short stories  
0              0    0            

## Step 3: Train-test split

In [16]:
from sklearn.model_selection import train_test_split

# Define features and target
X = data.drop(columns=['ISBN', 'Ratings'])
y = data['ISBN']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Display the shapes of the training and testing sets
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(246, 17) (83, 17) (246,) (83,)


## Step 4: Model training

In [17]:
from sklearn.ensemble import RandomForestClassifier

# Initialize the model
model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model
model.fit(X_train, y_train)

## Step 5: Model evaluation

In [18]:
from sklearn.metrics import accuracy_score

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model's performance
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

Accuracy: 0.00


## Step 6: Make predictions

In [19]:
# Example input: age, gender, liked categories
input_data = {
    'age': 24,
    'gender': 'Female',
    'Novels': 1,
    'Sci-Fi': 1,
    'Adventure': 1,
    # Add other categories with 0
}

# Convert input data to DataFrame
input_df = pd.DataFrame([input_data])

# Encode the gender
input_df['gender'] = label_encoder.transform(input_df['gender'])

# Make a prediction
predicted_isbn = model.predict(input_df)
print(f'Suggested Book ISBN: {predicted_isbn[0]}')

ValueError: The feature names should match those that were passed during fit.
Feature names seen at fit time, yet now missing:
- Biographie
- Child
- Crime
- Educational
- Fiction
- ...
