# Practicum Yandex Test (Data Scientist)

### Task 1. Working with data
To complete this task, use the data set in the attached file. Indicate the answer to each of the following steps and time to complete the entire task.

**1.1.** Download the data set `movie_metadata.csv`, which contains data about films from IMDb (Internet Movie Database).

In [8]:
# Load Libs
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px

In [9]:
%matplotlib inline

In [49]:
# Reading CSV file
file = 'movie_metadata.csv'

df = pd.read_csv(file, sep=',', header=0, encoding="utf-8", 
                 low_memory=False, skipinitialspace=True, skip_blank_lines=True)

In [14]:
# Verify dataset length
df.shape

(5043, 28)

In [15]:
# Verify dataset data columns
df.head(5)

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0$,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0$,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0$,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0$,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,0.0$,,12.0,7.1,,0


In [16]:
# Analyse dataset distribution
df.describe()

Unnamed: 0,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_1_facebook_likes,gross,num_voted_users,cast_total_facebook_likes,facenumber_in_poster,num_user_for_reviews,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
count,4993.0,5028.0,4939.0,5020.0,5036.0,4159.0,5043.0,5043.0,5030.0,5022.0,4935.0,5030.0,5043.0,4714.0,5043.0
mean,140.194272,107.201074,686.509212,645.009761,6560.047061,48468410.0,83668.16,9699.063851,1.371173,272.770808,2002.470517,1651.754473,6.442138,2.220403,7525.964505
std,121.601675,25.197441,2813.328607,1665.041728,15020.75912,68452990.0,138485.3,18163.799124,2.013576,377.982886,12.474599,4042.438863,1.125116,1.385113,19320.44511
min,1.0,7.0,0.0,0.0,0.0,162.0,5.0,0.0,0.0,1.0,1916.0,0.0,1.6,1.18,0.0
25%,50.0,93.0,7.0,133.0,614.0,5340988.0,8593.5,1411.0,0.0,65.0,1999.0,281.0,5.8,1.85,0.0
50%,110.0,103.0,49.0,371.5,988.0,25517500.0,34359.0,3090.0,1.0,156.0,2005.0,595.0,6.6,2.35,166.0
75%,195.0,118.0,194.5,636.0,11000.0,62309440.0,96309.0,13756.5,2.0,326.0,2011.0,918.0,7.2,2.35,3000.0
max,813.0,511.0,23000.0,23000.0,640000.0,760505800.0,1689764.0,656730.0,43.0,5060.0,2016.0,137000.0,9.5,16.0,349000.0


**1.2.** The `duration` column contains data on the film length. How many missing values are there in this column?

In [50]:
total_duration_isnan = df['duration'].isnull().sum()
percent_duration_isnan = (df['duration'].isna().mean() * 100)

print("Total duration missing values: {0} ({1} %)".format(total_duration_isnan, str(round(percent_duration_isnan, 2))))

Total duration missing values: 15 (0.3 %)


**1.3.** Replace the missing values in the `duration` column with the median value for this column.

In [51]:
median_duration = df['duration'].median()
df['duration'].fillna(value=median_duration, inplace=True)

print("Median duration: ", median_duration)

Median duration:  103.0


In [52]:
# Validate operation
df['duration'].isnull().sum()

0

**1.4.** What is the average film length? Give the answer as a floating-point figure rounded to two decimal places.

In [53]:
average_duration = df['duration'].mean()

print("Average duration: ", str(round(average_duration, 2)))

Average duration:  107.19


**1.5.** How many films between 90 minutes and two hours long were released in 2008?

In [54]:
filter_more90min = df['duration'] > 90
filter_less120min = df['duration'] < 120
filter_2008 = df['title_year'] == 2008
total_movies_2008_90to120min = df.where(filter_more90min & filter_less120min & filter_2008).count().sum()

print("Total movies released in 2008 and 90 to 120 min: ", total_movies_2008_90to120min)

Total movies released in 2008 and 90 to 120 min:  4010


**1.6.** The `budget` column contains the film's budget. What is the median budget for all the films listed? Give the answer as an integer.

In [55]:
# Verify budget column
df['budget'].isnull().sum()

0

In [56]:
df['budget'].describe()

count     5043
unique     440
top       0.0$
freq       492
Name: budget, dtype: object

In [57]:
df['budget'] = df['budget'].str.replace('\$', '')


The default value of regex will change from True to False in a future version.



In [61]:
median_budget = df['budget'].median()

print("Median budget: $", int(median_budget))

Median budget: $ 15000000


### Task 2. Answering student questions
How would you answer the student's question below? Your task is to get your message across in such a way that a beginner can understand your explanation. You can do this any way you want (pictures, GIFs, metaphors, anything) so long as it makes your explanation clear. Indicate how much time you spent completing this task.  
  
**What is the difference between DataFrame and Series?**

### Task 3. Task on probability theory

You are given two random variables *X* and *Y*.

```
E(X) = 0.5, Var(X) = 2

E(Y) = 7, Var(Y) = 3.5

cov (X, Y) = -0.8
```

Find the *variance* of the random variable **Z = 2X - 3Y**

### Task 4. Task for Data Science course

Omer trained a linear regression model and tested its performance on a test sample of 500 objects. On 400 of those, the model returned a prediction higher than expected by 0.5, and on the remaining 100, the model returned a prediction lower than expected by 0.7.

What is the MSE for his model?

Limor claims that the linear regression model wasn't trained correctly, and we can do improve it by changing all the answers by a constant value. What will be her MSE?

You can assume that Limor found the smallest error under her constraints.

**Return two values - Omer's and Limor's MSE.**

### Task 5. Please make a short video 
Video under 5 minutes, and tell us more about yourself, your experience in DS/DA and the reasons you're interested in becoming a tutor with Practicum. If you're allowed to, tell us about the most interesting projects you have worked on. 