# Occupation

### Introduction:

Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

### Step 1. Import the necessary libraries

In [1]:
import pandas as pd

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user). 

The header and first few rows of the data set are as follows:

user_id|age|gender|occupation|zip_code

1|24|M|technician|85711

2|53|F|other|94043

3|23|M|writer|32067

4|24|M|technician|43537

5|33|F|other|15213

6|42|M|executive|98101

### Step 3. Assign it to a variable called users.

In [3]:
users = pd.read_table('https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user', 
                      sep='|', index_col='user_id')
users.head()

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,24,M,technician,85711
2,53,F,other,94043
3,23,M,writer,32067
4,24,M,technician,43537
5,33,F,other,15213


### Step 4. Discover what is the mean age per occupation

In [4]:
users.groupby('occupation').age.mean()

occupation
administrator    38.746835
artist           31.392857
doctor           43.571429
educator         42.010526
engineer         36.388060
entertainment    29.222222
executive        38.718750
healthcare       41.562500
homemaker        32.571429
lawyer           36.750000
librarian        40.000000
marketing        37.615385
none             26.555556
other            34.523810
programmer       33.121212
retired          63.071429
salesman         35.666667
scientist        35.548387
student          22.081633
technician       33.148148
writer           36.311111
Name: age, dtype: float64

### Step 5. Discover the Male ratio per occupation and sort it from the most to the least

In [6]:
# create a function
def gender_to_numeric(x):
    if x == 'M':
        return 1
    if x == 'F':
        return 0

# apply the function to the gender column and create a new column
users['gender_n'] = users['gender'].apply(gender_to_numeric)

# Signature: users.apply(func, axis=0, broadcast=False, raw=False, reduce=None, args=(), **kwds)
# Docstring:
# Applies function along input axis of DataFrame.

# Objects passed to functions are Series objects having index
# either the DataFrame's index (axis=0) or the columns (axis=1).
# Return type depends on whether passed function aggregates, or the
# reduce argument if the DataFrame is empty.

# Parameters
# ----------
# func : function
#     Function to apply to each column/row
# axis : {0 or 'index', 1 or 'columns'}, default 0
#     * 0 or 'index': apply function to each column
#     * 1 or 'columns': apply function to each row
# broadcast : boolean, default False
#     For aggregation functions, return object of same size with values
#     propagated
# raw : boolean, default False
#     If False, convert each row or column into a Series. If raw=True the
#     passed function will receive ndarray objects instead. If you are
#     just applying a NumPy reduction function this will achieve much
#     better performance
# reduce : boolean or None, default None
#     Try to apply reduction procedures. If the DataFrame is empty,
#     apply will use reduce to determine whether the result should be a
#     Series or a DataFrame. If reduce is None (the default), apply's
#     return value will be guessed by calling func an empty Series (note:
#     while guessing, exceptions raised by func will be ignored). If
#     reduce is True a Series will always be returned, and if False a
#     DataFrame will always be returned.
# args : tuple
#     Positional arguments to pass to function in addition to the
#     array/series
# Additional keyword arguments will be passed as keywords to the function

# Notes
# -----
# In the current implementation apply calls func twice on the
# first column/row to decide whether it can take a fast or slow
# code path. This can lead to unexpected behavior if func has
# side-effects, as they will take effect twice for the first
# column/row.


a = users.groupby('occupation').gender_n.sum() / users.occupation.value_counts() * 100 

# sort to the most male 
a.sort_values(ascending = False)

# DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')[source]
# Sort by the values along either axis

# Parameter:
# by : str or list of str
# Name or list of names which refer to the axis items.
# axis : {0 or ‘index’, 1 or ‘columns’}, default 0
# Axis to direct sorting
# ascending : bool or list of bool, default True
# Sort ascending vs. descending. Specify list for multiple sort orders. If this is a list of bools, must match the length of the by.
# inplace : bool, default False
# if True, perform operation in-place
# kind : {‘quicksort’, ‘mergesort’, ‘heapsort’}, default ‘quicksort’
# Choice of sorting algorithm. See also ndarray.np.sort for more information. mergesort is the only stable algorithm. For DataFrames, this option is only applied when sorting on a single column or label.
# na_position : {‘first’, ‘last’}, default ‘last’
# first puts NaNs at the beginning, last puts NaNs at the end
# Returns:	
# sorted_obj : DataFrame

doctor           100.000000
engineer          97.014925
technician        96.296296
retired           92.857143
programmer        90.909091
executive         90.625000
scientist         90.322581
entertainment     88.888889
lawyer            83.333333
salesman          75.000000
educator          72.631579
student           69.387755
other             65.714286
marketing         61.538462
writer            57.777778
none              55.555556
administrator     54.430380
artist            53.571429
librarian         43.137255
healthcare        31.250000
homemaker         14.285714
dtype: float64

### Step 6. For each occupation, calculate the minimum and maximum ages

In [8]:
users.groupby('occupation').age.agg(['min', 'max'])

Unnamed: 0_level_0,min,max
occupation,Unnamed: 1_level_1,Unnamed: 2_level_1
administrator,21,70
artist,19,48
doctor,28,64
educator,23,63
engineer,22,70
entertainment,15,50
executive,22,69
healthcare,22,62
homemaker,20,50
lawyer,21,53


### Step 7. For each combination of occupation and gender, calculate the mean age

In [9]:
users.groupby(['occupation', 'gender']).age.mean()

occupation     gender
administrator  F         40.638889
               M         37.162791
artist         F         30.307692
               M         32.333333
doctor         M         43.571429
educator       F         39.115385
               M         43.101449
engineer       F         29.500000
               M         36.600000
entertainment  F         31.000000
               M         29.000000
executive      F         44.000000
               M         38.172414
healthcare     F         39.818182
               M         45.400000
homemaker      F         34.166667
               M         23.000000
lawyer         F         39.500000
               M         36.200000
librarian      F         40.000000
               M         40.000000
marketing      F         37.200000
               M         37.875000
none           F         36.500000
               M         18.600000
other          F         35.472222
               M         34.028986
programmer     F         32.16666

### Step 8.  For each occupation present the percentage of women and men

In [11]:
# create a data frame and apply count to gender
gender_ocup = users.groupby(['occupation', 'gender']).agg({'gender': 'count'})
# print(gender_ocup)

# create a DataFrame and apply count for each occupation
occup_count = users.groupby(['occupation']).agg('count')
# print(occup_count)

# divide the gender_ocup per the occup_count and multiply per 100
occup_gender = gender_ocup.div(occup_count, level = "occupation") * 100
print(occup_gender)

# Signature: gender_ocup.div(other, axis='columns', level=None, fill_value=None)
# Docstring:
# Floating division of dataframe and other, element-wise (binary operator `truediv`).

# Equivalent to ``dataframe / other``, but with support to substitute a fill_value for
# missing data in one of the inputs.

# Parameters
# ----------
# other : Series, DataFrame, or constant
# axis : {0, 1, 'index', 'columns'}
#     For Series input, axis to match Series index on
# fill_value : None or float value, default None
#     Fill missing (NaN) values with this value. If both DataFrame
#     locations are missing, the result will be missing
# level : int or name
#     Broadcast across a level, matching Index values on the
#     passed MultiIndex level

# Notes
# -----
# Mismatched indices will be unioned together

# Returns
# -------
# result : DataFrame

# present all rows from the 'gender column'
print(occup_gender.loc[: , 'gender'])

                      age      gender  gender_n  zip_code
occupation    gender                                     
administrator F       NaN   45.569620       NaN       NaN
              M       NaN   54.430380       NaN       NaN
artist        F       NaN   46.428571       NaN       NaN
              M       NaN   53.571429       NaN       NaN
doctor        M       NaN  100.000000       NaN       NaN
educator      F       NaN   27.368421       NaN       NaN
              M       NaN   72.631579       NaN       NaN
engineer      F       NaN    2.985075       NaN       NaN
              M       NaN   97.014925       NaN       NaN
entertainment F       NaN   11.111111       NaN       NaN
              M       NaN   88.888889       NaN       NaN
executive     F       NaN    9.375000       NaN       NaN
              M       NaN   90.625000       NaN       NaN
healthcare    F       NaN   68.750000       NaN       NaN
              M       NaN   31.250000       NaN       NaN
homemaker     