# Assignment \#2 - the Facebook dataset

Exploratory Data Analysis giving insights from the Facebook dataset.

Source: http://www.kaggle.com

**Features description**
- *userid*: anonymized user id
- *age*: age in years
- *dob_day*: date of birth (day)
- *dob_year*: date of birth (year)
- *dob_month*: date of birth (month)
- *gender*: gender
- *tenure*: number of days in Facebook
- *friend_count*: number of friends
- *friendships_initiated*: number of friends initiated
- *likes*: number of likes performed
- *likes_received*: number of likes received
- *mobile_likes*: number of mobile likes performed
- *mobile_likes_received*: number of mobile likes received
- *www_likes*: number of web likes performed
- *www_likes_received*: number of web likes received

#### Checklist before submitting your assignment:
- Check individually each function by invoking it.
- Check that functions which return floating point number, you round them to 1 decimal point.
- Clean up your notebook before exporting it to Python format: remove unnecessary cells where you performed tests...
- Check that the Python file do not contain any syntax error:
    * Open a command window and type: python assignment_2.py
    * Or copy paste the whole content of your Python file into a cell of your notebook and run it
    * No error should be output!
- Upload your file in the Google drive folder and check that the name of the file is appropriate, with no extra character, here: `assignment_2.py`

**Questions**

Load the dataset and answer to all questions by implementing a function with no arguments.

The dataset should be downloaded, saved and unzip along with your notebook, NOT in a specific folder.

The different functions should use one of the variables defined below (`df`, `df_female`, `df_male`) and perform the appropriate computation.

Many questions deal with differences between females and males in the dataset.
Therefore, we have computed and prepared 3 datasets which are to be used:
- `df`: whole dataset,
- `df_female`: dataset with females only,
- `df_male`: dataset with males only.

**Caution**: Questions asking to return a floating point number (mean, percentage) should round it to 1 decimal place:
- Such questions are marked with `(°)`.
- For instance, if the variable `result` is a floating point number, e.g. `3.14159265359`
- The function should return `round(result, 1)` instead of `result`, e.g. `3.1`
- Percentages should be returned as floating point numbers and not with the `%` mark.

**Tip**: All questions might be answered without implementing any loop nor any if/else statement.

**General questions**

0) How many rows does the whole DataFrame have?

1) How many columns does the whole DataFrame have?

**Females and males**

2) How many females are there in the dataset?

3) How many males are there in the dataset?

**Total number of likes performed**

4) What is the total number of likes performed by females?

5) What is the total number of likes performed by males?

**Total number of likes received**

6) What is the total number of likes received by females?

7) What is the total number of likes received by males?

**Mean age (°)**

8) What is the mean age of females (°)?

9) What is the mean age of males (°)?

**Mean friend count (°)**

10) What is the mean friend count of females (°)?

11) What is the mean friend count of males (°)?

**Mean number of likes performed (°)**

12) What is the mean number of likes performed by females (°)?

13) What is the mean number of likes performed by males (°)?

**Mean number of likes received (°)**

14) What is the mean number of likes received by females (°)?

15) What is the mean number of likes received by males (°)?

**Percentage who did not performed any like (°)**

16) What is the percentage of females who did not performed any like (°)?

17) What is the percentage of males who did not performed any like (°)?

**Percentage who did not received any like (°)**

18) What is the percentage of females who did not received any like (°)?

19) What is the percentage of males who did not received any like( °)?

**Percentage of mobile likes performed compared to all likes performed (°)**

20) What is the percentage of mobile likes performed by females compared to all likes performed by females (°)?

21) What is the percentage of mobile likes performed by males compared to all likes performed by males (°)?

**Percentage of mobile likes received compared to all likes received (°)**

22) What is the percentage of mobile likes received by females compared to all likes received by females (°)?

23) What is the percentage of mobile likes received by males compared to all likes received by males (°)?

**Most frequent age**

24) What is the most frequent age of all users?

**Friendship vs friendships initiated (°)**

25) What is the mean of the difference between friend count and friendships initiated for females (°)?

26) What is the mean of the difference between friend count and friendships initiated for males (°)?

**Months and days of the given date of birth (°)**

The last 4 questions deal with the months and days used by Facebook users for their date of birth.

The idea is that people do not give their actual date of birth. They often switch the actual month to Januray and the actual day to the first day of a month.

The percentage of people being born in January is *circa* $100 \times \frac{1}{12}$ since there are 12 months in a year.

We are going to compute the difference between the percentage found in the dataset and this theorical percentage.

27) What is the difference between the percentage of females born in January and 100/12 (°)?

28) What is the difference between the percentage of males born in January and 100/12 (°)?

The percentage of people being born the first day of a month is *circa* $100 \times \frac{12}{365}$ since there are 12 days which are the first day of a month in a year and there are 365 days in a year (forget the leap years).

We are going to compute the difference between the percentage found in the dataset and this theorical percentage.

29) What is the difference between the percentage of females born the first day of a month and 1200/365 (°)?

30) What is the difference between the percentage of males born the first day of a month and 1200/365 (°)?

(°) Result of functions should be rounded to 1 decimal place.

**Now you have a better idea of some differences between female and male Facebook users!**

In [27]:
# usual import and options
import pandas as pd
pd.set_option("display.max_rows", 16)
pd.set_option("display.max_columns", 30)

In [28]:
# DO NOT MODIFY THIS CELL
# MAKE SURE THAT THE FILE 'pseudo_facebook.csv' IS ALONG WITH THE DATASET 'pseudo_facebook.csv'
# USE THE 3 VARIABLES: df, df_female, df_male IN YOUR FUNCTIONS
# ALL FUNCTIONS SHOULD HAVE NO ARGUMENT

# load the dataset and build subsets with females and males
df = pd.read_csv('pseudo_facebook.csv')
df_female = df.loc[df['gender'] == 'female']
df_male = df.loc[df['gender'] == 'male']
df.head()

Unnamed: 0,userid,age,dob_day,dob_year,dob_month,gender,tenure,friend_count,friendships_initiated,likes,likes_received,mobile_likes,mobile_likes_received,www_likes,www_likes_received
0,2094382,14,19,1999,11,male,266.0,0,0,0,0,0,0,0,0
1,1192601,14,2,1999,11,female,6.0,0,0,0,0,0,0,0,0
2,2083884,14,16,1999,11,male,13.0,0,0,0,0,0,0,0,0
3,1203168,14,25,1999,12,female,93.0,0,0,0,0,0,0,0,0
4,1733186,14,4,1999,12,male,82.0,0,0,0,0,0,0,0,0


In [29]:
# THIS IS AN EXAMPLE. THERE IS NOTHING TO DO. IT WILL NOT BE GRADED.
# ALL FUNCTIONS SHOULD FOLLOW THE SAME PATTERN:
# - DEFINITION OF THE FUNCTION: def exercise_XX():
# - COMPUTATION OF THE RESULT: result = ...
# - RETURN OF THE RESULT: return result

# 0) How many rows does the dataFrame have?
def exercise_00():
    result = len(df)
    return result

# run and check
exercise_00()

99003

In [44]:
# 1) How many columns does the DataFrame have?
def exercise_01():
    x = len(df.columns)
    return x

# run and check
exercise_01()

15

In [45]:
# 2) How many females are there in the dataset?
def exercise_02():
    x = len(df_female)
    return x

# run and check
exercise_02()

40254

In [46]:
# 3) How many males are there in the dataset?
def exercise_03():
    x = len(df_male)
    return x

# run and check
exercise_03()

58574

In [47]:
# 4) What is the total number of likes performed by females?
def exercise_04():
    x = df_female['likes'].sum()
    return x

# run and check
exercise_04()

10468106

In [48]:
# 5) What is the total number of likes performed by males?
def exercise_05():
    x = df_male['likes'].sum()
    return x

# run and check
exercise_05()

4959923

In [117]:
# 6) What is the total number of likes received by females?
def exercise_06():
    x = df_female['likes_received'].sum()
    return x

# run and check
exercise_06()

10121282

In [50]:
# 7) What is the total number of likes received by males?
def exercise_07():
    x = df_male['likes_received'].sum()
    return x

# run and check
exercise_07()

3977851

In [38]:
# 8) What is the mean age of females (°)?
def exercise_08():
    x = round(df_female['age'].mean(),1)
    return x

# run and check
exercise_08()

39.5

In [39]:
# 9) What is the mean age of males (°)?
def exercise_09():
    x= round(df_male['age'].mean(),1)
    return x

# run and check
exercise_09()

35.7

In [40]:
# 10) What is the mean friend count of females (°)?
def exercise_10():
    x = round(df_female['friend_count'].mean(),1)
    return x

# run and check
exercise_10()

242.0

In [41]:
# 11) What is the mean friend count of males (°)?
def exercise_11():
    x = round(df_male['friend_count'].mean(),1)
    return x

# run and check
exercise_11()

165.0

In [42]:
# 12) What is the mean number of likes performed by females (°)?
def exercise_12():
    x = round(df_female['likes'].mean(),1)
    return x

# run and check
exercise_12()

260.1

In [43]:
# 13) What is the mean number of likes performed by males (°)?
def exercise_13():
    x = round(df_male['likes'].mean(),1)
    return x

# run and check
exercise_13()

84.7

In [51]:
# 14) What is the mean number of likes received by females (°)?
def exercise_14():
    x = round(df_female['likes_received'].mean(),1)
    return x

# run and check
exercise_14()

251.4

In [52]:
# 15) What is the mean number of likes received by males (°)?
def exercise_15():
    x = round(df_male['likes_received'].mean(),1)
    return x

# run and check
exercise_15()

67.9

In [112]:
# 16) What percentage of females did not performed any like (°)?
def exercise_16():
    x = (len(df_female.loc[df_female['likes']==0])/len(df_female))*100
    return round(x,1)

# run and check
exercise_16()

13.9

In [113]:
# 17) What percentage of males did not performed any like (°)?
def exercise_17():
    x = (len(df_male.loc[df_male['likes']==0])/len(df_male))*100
    return round(x,1)

# run and check
exercise_17()

28.5

In [114]:
# 18) What percentage of females did not received any like (°)?
def exercise_18():
    x = (len(df_female.loc[df_female['likes_received']==0])/len(df_female))*100
    return round(x,1)

# run and check
exercise_18()

15.5

In [115]:
# 19) What percentage of males did not received any like (°)?
def exercise_19():
    x = (len(df_male.loc[df_male['likes_received']==0])/len(df_male))*100
    return round(x,1)

# run and check
exercise_19()

31.0

In [116]:
# 20) What is the percentage of mobile likes performed by females compared to all likes performed by females (°)?
def exercise_20():
    x = (df_female['mobile_likes'].sum()/df_female['likes'].sum())*100
    return round(x,1)

# run and check
exercise_20()

66.5

In [58]:
# 21) What is the percentage of mobile likes performed by males compared to all likes performed by males (°)?
def exercise_21():
    x = df_male['mobile_likes'].sum()
    y = df_male['mobile_likes'].sum() + df_male['www_likes'].sum()
    z = round((x/y)*100,1)
    
    return z
    
# run and check
exercise_21()

71.2

In [118]:
# 22) What is the percentage of mobile likes received by females compared to all likes received by females (°)?
def exercise_22():
    x = df_female['mobile_likes_received'].sum()
    y = df_female['mobile_likes_received'].sum() + df_female['www_likes_received'].sum()
    z = round((x/y)*100,1)
    return z
    
# run and check
exercise_22()

58.5

In [65]:
# 23) What is the percentage of mobile likes received by males compared to all likes received by males (°)?
def exercise_23():
    x = df_male['mobile_likes_received'].sum()
    y = df_male['mobile_likes_received'].sum() + df_male['www_likes_received'].sum()
    z = round((x/y)*100,1)
    return z

# run and check
exercise_23()

60.1

In [66]:
# 24) What is the most frequent age of users?
def exercise_24():
    x = df['age'].value_counts().index[0]
    return x

# run and check
exercise_24()

18

In [104]:
# 25) What is the mean of the difference between friend count and friendships initiated for females (°)?
def exercise_25():
    x = df_female ['friend_count'] - df_female['friendships_initiated']
    y = round(x.mean(),1)
    return y

# run and check
exercise_25()

128.1

In [106]:
# 26) What is the mean of the difference between friend count and friendships initiated for males (°)?
def exercise_26():
    
    x = df_male['friend_count'] - df_male['friendships_initiated']
    y = round(x.mean(),1)
    return y

# run and check
exercise_26()

62.0

In [87]:
# 27) What is the difference between the percentage of females born in January and 100/12 (°)?
def exercise_27():
    w = (df_female['dob_month']==1).sum()
    x = len(df_female['dob_month'])
    y = round((w/x)*100,1)
    z = round(100/12,1)
    
    difference = round(abs(y-z),1)
    
    return difference


# run and check
exercise_27()

1.6

In [94]:
# 28) What is the difference between the percentage of males born in January and 100/12 (°)?
def exercise_28():
    w = (df_male['dob_month']==1).sum()
    x = len(df_male['dob_month'])
    y = round((w/x)*100,1)
    z = round(100/12,1)
    
    difference = round(abs(y-z),1)
    
    return difference
    
 
# run and check
exercise_28()

4.9

In [96]:
# 29) What is the difference between the percentage of females born the first of a month and 1200/365 (°)?
def exercise_29():
    w = (df_female['dob_day']==1).sum()
    x = len(df_female['dob_day'])
    y = round((w/x)*100,1)
    z = round(1200/365,1)
    
    difference = round(abs(y-z),1)
    
    return difference

# run and check
exercise_29()

2.7

In [97]:
# 30) What is the difference between the percentage of males born the first of a month and 1200/365 (°)?
def exercise_30():
    w = (df_male['dob_day']==1).sum()
    x = len(df_male['dob_day'])
    y = round ((w/x)*100,1)
    z = round (1200/365,1)
    
    difference = round(abs(y-z),1)
    
    return difference

# run and check
exercise_30()

6.0