# **DS Bootcamp. Team00**

## Team Ечпочмак:
   - *anibalpe*
   - *rammussc*


# Introduction
#### This report analyzes the MovieLens dataset using the custom analysis module. Each section below demonstrates different aspects of movie ratings, tags, metadata, and financial data from IMDB.


## All data you need in ml-latests-small:

1. **ratings.csv:** userId, movieId, rating, timestamp

2. **movies.csv:** movieId, title, genres

3. **tags.csv:** userId, movieId, tag, timestamp

4. **links.csv:** movieId, imdbId, tmdbId

## How to start projest
----------------------------------------------------------------

1. Clone repo: `git clone {Paste here text}`


2. Create vertual environment

```bash
python3 -m venv .env
```

3. Activate venv

```bash
source .env/bin/activate
```


4. Install necessary pachages:

```bash 
pip install -r requirements.txt
```


## Project Structure

1. `movielens_analysis`: Main module with classes and functions for data analysis and tests.

2. `movielens_report`: Main modul with all information

3. `requirements.txt`: List of required packages for the project.


## Troubleshoots:
   - FileNotFoundError: Ensure the `ml-latest-small/` files are in the specified path. Check file paths in the code.

   - ModuleNotFoundError: Verify that `requests`, `beautifulsoup4`, and `pytest` are installed in the active virtual environment.
   
   - IMDb Request Errors: Check your internet connection or increase the delay in `Links._fetch_page`
   
   - Test Failures: Ensure the dataset files match the expected format. Download a fresh copy of `ml-latest-small` if needed.

In [11]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Setting up with program

In [12]:
%timeit
from movielens_analysis import Ratings, Tags, Movies, Links

ratings = Ratings('ml-latest-small/ratings.csv')
tags = Tags('ml-latest-small/tags.csv')
movies = Movies('ml-latest-small/movies.csv')
links = Links('ml-latest-small/links.csv', limit=100)

## Ratings Analysis

### Distribution of movies by their year:

In [13]:
rate_movies = ratings.movies
result = rate_movies.dist_by_year()
print("Top 5 years with the most released movies:")
print(dict(sorted(result.items(), key=lambda x: x[1], reverse=True)[:5]))
%timeit dict(sorted(result.items(), key=lambda x: x[1], reverse=True)[:5])

Top 5 years with the most released movies:
{2000: 10061, 2017: 8199, 2007: 7111, 2016: 6702, 2015: 6616}
3.15 μs ± 208 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


### Distribution of movies by ratings

In [14]:
rate_movies = ratings.movies
rate_dist = rate_movies.dist_by_rating()
print("Counting of top 5 most chosen rate value:")
print(dict(sorted(rate_dist.items(), key=lambda x: x[1], reverse=True)[:5]))
%timeit dict(sorted(rate_dist.items(), key=lambda x: x[1], reverse=True)[:5])

Counting of top 5 most chosen rate value:
{4.0: 26818, 3.0: 20047, 5.0: 13211, 3.5: 13136, 4.5: 8551}
1.8 μs ± 15.7 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)


### Distribution of movies by number of ratings

In [15]:
rate_movies = ratings.movies
print("Top 5 mostly rated movies:")
print(rate_movies.top_by_num_of_ratings(5))
%timeit rate_movies.top_by_num_of_ratings(5)

Top 5 mostly rated movies:
{'Forrest Gump (1994)': 329, '"Shawshank Redemption, The (1994)"': 317, 'Pulp Fiction (1994)': 307, '"Silence of the Lambs, The (1991)"': 279, '"Matrix, The (1999)"': 278}
4.65 ms ± 301 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


### Distribution of movies by thier average

In [16]:
rate_movies = ratings.movies
print("Average of top 5 mostly rated movies:")
print(rate_movies.top_by_ratings(5))
%timeit rate_movies.top_by_ratings(5)

Average of top 5 mostly rated movies:
{'The Jinx: The Life and Deaths of Robert Durst (2015)': 5.0, 'Galaxy of Terror (Quest) (1981)': 5.0, 'Alien Contamination (1980)': 5.0, "I'm the One That I Want (2000)": 5.0, 'Lesson Faust (1994)': 5.0}
32.1 ms ± 2.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Distribution of movies by thier variance

In [17]:
rate_movies = ratings.movies
print("Average of top 5 controversial movies:")
print(rate_movies.top_controversial(5))
%timeit rate_movies.top_controversial(5)

Average of top 5 controversial movies:
{"Ivan's Childhood (a.k.a. My Name is Ivan) (Ivanovo detstvo) (1962)": 10.12, 'Fanny and Alexander (Fanny och Alexander) (1982)': 10.12, 'Lassie (1994)': 8.0, '"Zed & Two Noughts, A (1985)"': 8.0, 'Kwaidan (Kaidan) (1964)': 8.0}
48.7 ms ± 3.48 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Distribution of user ratings

In [18]:
users = ratings.users
result = users.dist_by_num_of_ratings()
print(dict(sorted(result.items(), key=lambda x: x[1], reverse=True)[:10]))
%timeit dict(sorted(result.items(), key=lambda x: x[1], reverse=True)[:10])

{21: 15, 20: 14, 22: 14, 56: 14, 23: 13, 26: 13, 35: 11, 33: 10, 25: 9, 34: 9}
25.1 μs ± 1.5 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


### Average of user rated

In [19]:
users = ratings.users
result = users.dist_by_metric()
print(dict(sorted(result.items(), key=lambda x: x[1], reverse=True)[:10]))
%timeit dict(sorted(result.items(), key=lambda x: x[1], reverse=True)[:10])

{3.54: 10, 4.0: 10, 3.65: 9, 3.91: 9, 3.57: 8, 3.77: 8, 3.78: 8, 4.12: 8, 3.36: 7, 3.39: 7}
20.8 μs ± 1.22 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


### Top users controversials

In [20]:
users = ratings.users
result = users.top_controversial(10)
print(dict(sorted(result.items(), key=lambda x: x[1], reverse=True)[:10]))
%timeit dict(sorted(result.items(), key=lambda x: x[1], reverse=True)[:10])

{'3': 4.37, '55': 3.22, '461': 3.22, '259': 3.05, '329': 3.05, '175': 2.87, '502': 2.84, '598': 2.84, '393': 2.63, '138': 2.56}
1.84 μs ± 21.1 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)


## Tags Analysis

### Movies with most words in title

In [21]:
print(f"Top 5 Movies with the most words in title: \n{tags.most_words(5)}")
%timeit tags.most_words(5)

Top 5 Movies with the most words in title: 
{'Something for everyone in this one... saw it without and plan on seeing it with kids!': 16, 'the catholic church is the most corrupt organization in history': 10, 'villain nonexistent or not needed for good story': 8, '06 Oscar Nominated Best Movie - Animation': 7, 'It was melodramatic and kind of dumb': 7}
1.06 ms ± 28 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


### Movies with the most symbols in title

In [22]:
print(f"Top 5 Movies with longest titles:\n{tags.longest(5)}")
%timeit tags.longest(5)

Top 5 Movies with longest titles:
['Something for everyone in this one... saw it without and plan on seeing it with kids!', 'the catholic church is the most corrupt organization in history', 'villain nonexistent or not needed for good story', 'r:disturbing violent content including rape', '06 Oscar Nominated Best Movie - Animation']
816 μs ± 28.1 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


### The most popular tags in movies

In [23]:
print(f"The 5 most tags:\n{tags.most_popular(5)}")
%timeit tags.most_popular(5)

The 5 most tags:
{'In Netflix queue': 131, 'atmospheric': 36, 'superhero': 24, 'thought-provoking': 24, 'Disney': 23}
1.16 ms ± 23.4 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


### Tags related to the given argument

In [24]:
print(f"The tags with 'Black' in: \n{tags.tags_with("Black")}")
%timeit tags.tags_with("Black")

The tags with 'Black' in: 
['Black comedy', 'black and white', 'black comedy', 'black hole', 'black humor', 'black humour', 'black-and-white']
110 μs ± 1.15 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


## Movies Analysis

### Distribution by the year of release

In [25]:
result = movies.dist_by_release()
print("Top 5 years with the most released movies:")
print(dict(sorted(result.items(), key=lambda x: x[1], reverse=True)[:5]))
%timeit dict(sorted(result.items(), key=lambda x: x[1], reverse=True)[:5])

Top 5 years with the most released movies:
{'2015': 267, '2002': 257, '2014': 247, '2001': 235, '2006': 227}
10.4 μs ± 424 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


### Distribution by the genres of movie

In [26]:
result = movies.dist_by_genres()
print("Top 5 genres and their counts:")
print(dict(sorted(result.items(), key=lambda x: x[1], reverse=True)[:5]))
%timeit dict(sorted(result.items(), key=lambda x: x[1], reverse=True)[:5])

Top 5 genres and their counts:
{'Drama': 4361, 'Comedy': 3756, 'Thriller': 1894, 'Action': 1828, 'Romance': 1596}
2.9 μs ± 136 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


### Movies with the most consisting genres

In [27]:
print(f"The 10 movies with the count of genres:\n{movies.most_genres(10)}")
%timeit movies.most_genres(10)

The 10 movies with the count of genres:
OrderedDict({'Rubber (2010)': 10, 'Patlabor: The Movie (Kidô keisatsu patorebâ: The Movie) (1989)': 8, 'Aelita: The Queen of Mars (Aelita) (1924)': 7, 'Aqua Teen Hunger Force Colon Movie Film for Theaters (2007)': 7, 'Enchanted (2007)': 7, 'Inception (2010)': 7, 'Interstate 60 (2002)': 7, 'Mars Needs Moms (2011)': 7, 'Mulan (1998)': 7, 'Osmosis Jones (2001)': 7})
9.94 ms ± 499 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## Metadata and Financial analysis

### Preparing parsing data

In [28]:
fields = ["Director", "Budget", "Cumulative Worldwide Gross", "Runtime"]
links.collect_all_imdb_data(fields)
%timeit links.collect_all_imdb_data(fields)

[INFO] Collecting data for up to 100 movies...
[INFO] Collecting data for up to 100 movies...
[INFO] Collecting data for up to 100 movies...
[INFO] Collecting data for up to 100 movies...
[INFO] Collecting data for up to 100 movies...
[INFO] Collecting data for up to 100 movies...
[INFO] Collecting data for up to 100 movies...
[INFO] Collecting data for up to 100 movies...
[INFO] Collecting data for up to 100 movies...
[INFO] Collecting data for up to 100 movies...
[INFO] Collecting data for up to 100 movies...
[INFO] Collecting data for up to 100 movies...
[INFO] Collecting data for up to 100 movies...
[INFO] Collecting data for up to 100 movies...
[INFO] Collecting data for up to 100 movies...
[INFO] Collecting data for up to 100 movies...
[INFO] Collecting data for up to 100 movies...
[INFO] Collecting data for up to 100 movies...
[INFO] Collecting data for up to 100 movies...
[INFO] Collecting data for up to 100 movies...
[INFO] Collecting data for up to 100 movies...
[INFO] Collec

### 3 Movies with metadata of their fields

In [29]:
print(links.get_imdb([1, 2, 3], fields))
%timeit links.get_imdb([1, 2, 3], fields)

[['Grumpier Old Men (1995)', 'Howard Deutch', '$25,000,000 (estimated)', '$71,518,503', 'Runtime1 hour 41 minutes'], ['Jumanji (1995)', 'Joe Johnston', '$65,000,000 (estimated)', '$262,821,940', 'Runtime1 hour 44 minutes'], ['Toy Story (1995)', 'John Lasseter', '$30,000,000 (estimated)', '$394,436,586', 'Runtime1 hour 21 minutes']]
3.14 μs ± 188 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


### Popular directors in movies

In [30]:
print(links.top_directors(2))
%timeit links.top_directors(2)

{'Martin Scorsese': 2, 'John Lasseter': 1}
42.4 μs ± 2.31 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


### Most expensive movies regarding their spendings

In [31]:
print(links.most_expensive(5))
%timeit links.most_expensive(5)

{'Cutthroat Island (1995)': 98000000, 'Braveheart (1995)': 72000000, 'Money Train (1995)': 68000000, 'Jumanji (1995)': 65000000, '"American President': 62000000}
448 μs ± 22.3 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


### Most profitable movies regarding their budget

In [32]:
print(links.most_profitable(5))
%timeit links.most_profitable(5)

{'Toy Story (1995)': 364436586, 'Seven (a.k.a. Se7en) (1995)': 295983304, 'GoldenEye (1995)': 292194034, 'Pocahontas (1995)': 291079773, 'Babe (1995)': 224134910}
591 μs ± 19.1 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


### Longest movies according their runtime

In [33]:
print(links.longest(5))
%timeit links.longest(5)

{'Nixon (1995)': 3, 'Waiting to Exhale (1995)': 2, 'Heat (1995)': 2, 'Sabrina (1995)': 2, 'GoldenEye (1995)': 2}
140 μs ± 1.11 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


### Profit regarding on runtime of movie

In [34]:
print(links.top_cost_per_minute(5))
%timeit links.top_cost_per_minute(5)

{'Money Train (1995)': 68000000.0, 'Jumanji (1995)': 65000000.0, '"American President': 62000000.0, 'Pocahontas (1995)': 55000000.0, 'Fair Game (1995)': 50000000.0}
598 μs ± 20.3 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
