# Occupation

**Dataset**: https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user

## Getting and Knowing the Data

**Import necessary libraries**

In [1]:
import numpy as np
import pandas as pd

**Import the dataset from the url address. Assign it to a variable and use the 'user_id' as index**

In [2]:
url = "https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user"

user_df = pd.read_csv(url, sep='|', index_col='user_id')

**See the first 25 entries**

In [3]:
user_df.head(25)

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,24,M,technician,85711
2,53,F,other,94043
3,23,M,writer,32067
4,24,M,technician,43537
5,33,F,other,15213
6,42,M,executive,98101
7,57,M,administrator,91344
8,36,M,administrator,5201
9,29,M,student,1002
10,53,M,lawyer,90703


**See the last 10 entries**

In [4]:
user_df.tail(10)

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
934,61,M,engineer,22902
935,42,M,doctor,66221
936,24,M,other,32789
937,48,M,educator,98072
938,38,F,technician,55038
939,26,F,student,33319
940,32,M,administrator,2215
941,20,M,student,97229
942,48,F,librarian,78209
943,22,M,student,77841


**Check the number of observations in the dataset**

In [5]:
user_df.shape[0]

943

**Check the number of columns in the dataset**

In [6]:
user_df.shape[1]

4

**Print the name of all the columns**

In [7]:
user_df.columns

Index(['age', 'gender', 'occupation', 'zip_code'], dtype='object')

**Print the name of the index**

In [8]:
user_df.index

Int64Index([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,
            ...
            934, 935, 936, 937, 938, 939, 940, 941, 942, 943],
           dtype='int64', name='user_id', length=943)

**Check the data type of each column**

In [9]:
# user_df.info()

# or
user_df.dtypes

age            int64
gender        object
occupation    object
zip_code      object
dtype: object

**Print only the occupation column**

In [10]:
# user_df['occupation']

# or
user_df.occupation

user_id
1         technician
2              other
3             writer
4         technician
5              other
           ...      
939          student
940    administrator
941          student
942        librarian
943          student
Name: occupation, Length: 943, dtype: object

**Check the number of unique occupations in the dataset**

In [11]:
number_occupations = user_df['occupation'].nunique()
# or
# number_occupations = user_df['occupation'].value_counts().count()

print('The number of unique occupations is {}'.format(number_occupations))

The number of unique occupations is 21


<font color=blue>**Check the most frequent occupation**

In [12]:
user_df.occupation.value_counts().head(1).index[0]

'student'

**Summarize the dataset**

In [13]:
user_df.describe()

Unnamed: 0,age
count,943.0
mean,34.051962
std,12.19274
min,7.0
25%,25.0
50%,31.0
75%,43.0
max,73.0


**Summarize all the columns**

In [14]:
# NOTE: By default, only the numeric columns are returned.
user_df.describe(include='all')

Unnamed: 0,age,gender,occupation,zip_code
count,943.0,943,943,943.0
unique,,2,21,795.0
top,,M,student,55414.0
freq,,670,196,9.0
mean,34.051962,,,
std,12.19274,,,
min,7.0,,,
25%,25.0,,,
50%,31.0,,,
75%,43.0,,,


**Summarize only the occupation column**

In [15]:
user_df.occupation.describe()

count         943
unique         21
top       student
freq          196
Name: occupation, dtype: object

**Compute the average age of user**

In [16]:
round(user_df.age.mean())

34

<font color=blue>**Which age has the least occurrence**

In [17]:
user_df.age.value_counts().tail(1).index[0]

73

**What is the mean age per occupation**

In [18]:
user_df.groupby('occupation').age.mean()

occupation
administrator    38.746835
artist           31.392857
doctor           43.571429
educator         42.010526
engineer         36.388060
entertainment    29.222222
executive        38.718750
healthcare       41.562500
homemaker        32.571429
lawyer           36.750000
librarian        40.000000
marketing        37.615385
none             26.555556
other            34.523810
programmer       33.121212
retired          63.071429
salesman         35.666667
scientist        35.548387
student          22.081633
technician       33.148148
writer           36.311111
Name: age, dtype: float64

<font color=blue>**Compute the male ratio per occupation and sort it from the most to the least**

In [20]:
# Create a function
def gender_to_numeric(x):
    return 1 if x == 'M' else 0

# Apply the function to the gender column and create a new column
user_df['gender_n'] = user_df['gender'].apply(gender_to_numeric)

male_ratio = user_df.groupby('occupation').gender_n.sum() / user_df.occupation.value_counts() * 100

# Sort in descending order
male_ratio.sort_values(ascending=False)

doctor           100.000000
engineer          97.014925
technician        96.296296
retired           92.857143
programmer        90.909091
executive         90.625000
scientist         90.322581
entertainment     88.888889
lawyer            83.333333
salesman          75.000000
educator          72.631579
student           69.387755
other             65.714286
marketing         61.538462
writer            57.777778
none              55.555556
administrator     54.430380
artist            53.571429
librarian         43.137255
healthcare        31.250000
homemaker         14.285714
dtype: float64

<font color=blue>**For each occupation, calculate the minimum and maximum ages**

In [28]:
# user_df.groupby('occupation').age.min() 
# user_df.groupby('occupation').age.max()

user_df.groupby('occupation').age.agg(['min', 'max'])
# user_df.groupby('occupation').age.apply('min')

Unnamed: 0_level_0,min,max
occupation,Unnamed: 1_level_1,Unnamed: 2_level_1
administrator,21,70
artist,19,48
doctor,28,64
educator,23,63
engineer,22,70
entertainment,15,50
executive,22,69
healthcare,22,62
homemaker,20,50
lawyer,21,53


<font color=blue>**For each combination of occupation and gender, calculate the mean age**

In [31]:
user_df.groupby(['occupation', 'gender']).age.mean()

occupation     gender
administrator  F         40.638889
               M         37.162791
artist         F         30.307692
               M         32.333333
doctor         M         43.571429
educator       F         39.115385
               M         43.101449
engineer       F         29.500000
               M         36.600000
entertainment  F         31.000000
               M         29.000000
executive      F         44.000000
               M         38.172414
healthcare     F         39.818182
               M         45.400000
homemaker      F         34.166667
               M         23.000000
lawyer         F         39.500000
               M         36.200000
librarian      F         40.000000
               M         40.000000
marketing      F         37.200000
               M         37.875000
none           F         36.500000
               M         18.600000
other          F         35.472222
               M         34.028986
programmer     F         32.16666

<font color=blue>**For each occupation, present the percentage of women and men**

In [40]:
# Create a DataFrame and apply count to gender
occupation_genderCount = user_df.groupby(['occupation', 'gender']).agg({'gender': 'count'})
# occupation_genderCount = user_df.groupby(['occupation', 'gender']).gender.apply('count')

# Create a DataFarme and apply count for each occupation
occupation_count = user_df.groupby('occupation').apply('count')
# occupation_count = user_df.groupby('occupation').agg('count')


# Divide the occupation_genderCount per occupation_count and mutiply by 100
occupation_gender = occupation_genderCount.div(occupation_count, level='occupation') * 100
occupation_gender.loc[:, 'gender']

occupation     gender
administrator  F          45.569620
               M          54.430380
artist         F          46.428571
               M          53.571429
doctor         M         100.000000
educator       F          27.368421
               M          72.631579
engineer       F           2.985075
               M          97.014925
entertainment  F          11.111111
               M          88.888889
executive      F           9.375000
               M          90.625000
healthcare     F          68.750000
               M          31.250000
homemaker      F          85.714286
               M          14.285714
lawyer         F          16.666667
               M          83.333333
librarian      F          56.862745
               M          43.137255
marketing      F          38.461538
               M          61.538462
none           F          44.444444
               M          55.555556
other          F          34.285714
               M          65.714286
progra