# Movie Recommendation System

## Cel projektu

- Zbudowanie systemu rekomendacji filmów na podstawie ocen użytkowników.
- Wykorzystanie SQL do analizy danych i Python do implementacji modelu rekomendacji.

In [3]:
import pandas as pd
import numpy as np
import scipy.stats
from sqlalchemy import create_engine
import seaborn as sns
import matplotlib.pyplot as plt

## Wczytywanie danych

In [6]:
movies = pd.read_csv("../input/movies.csv")# wyłaczenie trybu oszczedzania pamięci
links = pd.read_csv("../input/links.csv")
ratings = pd.read_csv("../input/ratings.csv", low_memory=False)
tags = pd.read_csv("../input/tags.csv")

### Analiza struktury danych

#### movies

In [10]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87585 entries, 0 to 87584
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  87585 non-null  int64 
 1   title    87585 non-null  object
 2   genres   87585 non-null  object
dtypes: int64(1), object(2)
memory usage: 2.0+ MB


##### movieId

movieId- to unikalna liczba dla kazdego filmu, nie ma duplikatów bedzie kluczem podstawowym PRIMARY KEY

In [27]:
movies['movieId'].duplicated().value_counts()

movieId
False    87585
Name: count, dtype: int64

##### title

title- to kolumna z tytułem filmu, zastosuje typ danych TEXT

In [30]:
movies['title']

0                          Toy Story (1995)
1                            Jumanji (1995)
2                   Grumpier Old Men (1995)
3                  Waiting to Exhale (1995)
4        Father of the Bride Part II (1995)
                        ...                
87580             The Monroy Affaire (2022)
87581            Shelter in Solitude (2023)
87582                           Orca (2023)
87583                The Angry Breed (1968)
87584             Race to the Summit (2023)
Name: title, Length: 87585, dtype: object

##### genres

genres- jest to kolumna z gatunkami, zastosuje typ danych TEXT

In [32]:
movies['genres']

0        Adventure|Animation|Children|Comedy|Fantasy
1                         Adventure|Children|Fantasy
2                                     Comedy|Romance
3                               Comedy|Drama|Romance
4                                             Comedy
                            ...                     
87580                                          Drama
87581                                   Comedy|Drama
87582                                          Drama
87583                                          Drama
87584                   Action|Adventure|Documentary
Name: genres, Length: 87585, dtype: object

#### links

In [12]:
links.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87585 entries, 0 to 87584
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   movieId  87585 non-null  int64  
 1   imdbId   87585 non-null  int64  
 2   tmdbId   87461 non-null  float64
dtypes: float64(1), int64(2)
memory usage: 2.0 MB


In [68]:
links.sample()

Unnamed: 0,movieId,imdbId,tmdbId
21762,112095,66273,163683.0


##### movieId

In [40]:
links['movieId'].sample(10)

16528     87227
23587    118528
22993    116724
86221    287861
38100    153340
54021    187251
65040    211728
27237    128476
78040    260591
69158    222755
Name: movieId, dtype: int64

In [42]:
links['movieId'].duplicated().value_counts()

movieId
False    87585
Name: count, dtype: int64

In [46]:
links['imdbId']

0          114709
1          113497
2          113228
3          114885
4          113041
           ...   
87580    26812510
87581    14907358
87582    12388280
87583       64027
87584    28995566
Name: imdbId, Length: 87585, dtype: int64

#### ratings

In [14]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32000204 entries, 0 to 32000203
Data columns (total 4 columns):
 #   Column     Dtype  
---  ------     -----  
 0   userId     int64  
 1   movieId    int64  
 2   rating     float64
 3   timestamp  int64  
dtypes: float64(1), int64(3)
memory usage: 976.6 MB


#### tags

In [19]:
tags.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000072 entries, 0 to 2000071
Data columns (total 4 columns):
 #   Column     Dtype 
---  ------     ----- 
 0   userId     int64 
 1   movieId    int64 
 2   tag        object
 3   timestamp  int64 
dtypes: int64(3), object(1)
memory usage: 61.0+ MB


In [70]:
movies[movies['movieId'] == 112095]

Unnamed: 0,movieId,title,genres
21762,112095,R.P.M. (1970),Drama


In [None]:


def info_column(data):
    for x in range(len(data.columns)):
        column = data.columns[x]
        print(f"***{column}***")
        print(data[column].value_counts())
        print("__"*10)
        print('')
info_column(movies_metadata)

In [72]:
links[links['movieId'] == 112095]

Unnamed: 0,movieId,imdbId,tmdbId
21762,112095,66273,163683.0
