# Occupation

### Introduction:

Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

### Step 1. Import the necessary libraries

In [1]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz
!tar xf spark-3.3.0-bin-hadoop3.tgz

!pip install -q findspark

In [4]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.3.0-bin-hadoop3"

import findspark
findspark.init()

import pandas as pd

from pyspark.sql import SparkSession

spark = SparkSession.builder\
        .master("local")\
        .appName("Colab")\
        .config('spark.ui.port', '4050')\
        .getOrCreate()

import pyspark.sql.functions as F

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user). 

In [5]:
url = "https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user"

### Step 3. Assign it to a variable called users.

In [34]:
from pyspark import SparkFiles
spark.sparkContext.addFile(url)
users = spark.read.csv(SparkFiles.get("u.user"), header=True, sep="|")

In [35]:
users.show(10)

+-------+---+------+-------------+--------+
|user_id|age|gender|   occupation|zip_code|
+-------+---+------+-------------+--------+
|      1| 24|     M|   technician|   85711|
|      2| 53|     F|        other|   94043|
|      3| 23|     M|       writer|   32067|
|      4| 24|     M|   technician|   43537|
|      5| 33|     F|        other|   15213|
|      6| 42|     M|    executive|   98101|
|      7| 57|     M|administrator|   91344|
|      8| 36|     M|administrator|   05201|
|      9| 29|     M|      student|   01002|
|     10| 53|     M|       lawyer|   90703|
+-------+---+------+-------------+--------+
only showing top 10 rows



### Step 4. Discover what is the mean age per occupation

In [36]:
from pyspark.sql.types import StringType, IntegerType

In [37]:
users = users.withColumn("age", F.col("age").cast(IntegerType()))

In [38]:
users.groupby("occupation").avg("age").show()

+-------------+------------------+
|   occupation|          avg(age)|
+-------------+------------------+
|    librarian|              40.0|
|      retired| 63.07142857142857|
|       lawyer|             36.75|
|         none|26.555555555555557|
|       writer| 36.31111111111111|
|   programmer|33.121212121212125|
|    marketing| 37.61538461538461|
|        other|34.523809523809526|
|    executive|          38.71875|
|    scientist| 35.54838709677419|
|      student|22.081632653061224|
|     salesman|35.666666666666664|
|       artist|31.392857142857142|
|   technician|33.148148148148145|
|administrator| 38.74683544303797|
|     engineer| 36.38805970149254|
|   healthcare|           41.5625|
|     educator| 42.01052631578948|
|entertainment| 29.22222222222222|
|    homemaker| 32.57142857142857|
+-------------+------------------+
only showing top 20 rows



### Step 5. Discover the Male ratio per occupation and sort it from the most to the least

In [39]:
def gender_to_num(g):
  if g=="M":
    return 1
  else:
    return 0
udf_gender_to_num = F.udf(gender_to_num, IntegerType())


In [40]:
users = users.withColumn("gender_n", udf_gender_to_num("gender"))

In [44]:
users.groupby("occupation").agg(F.sum("gender_n")/F.count("gender")).show()

+-------------+-------------------------------+
|   occupation|(sum(gender_n) / count(gender))|
+-------------+-------------------------------+
|    librarian|            0.43137254901960786|
|      retired|             0.9285714285714286|
|       lawyer|             0.8333333333333334|
|         none|             0.5555555555555556|
|       writer|             0.5777777777777777|
|   programmer|             0.9090909090909091|
|    marketing|             0.6153846153846154|
|        other|             0.6571428571428571|
|    executive|                        0.90625|
|    scientist|             0.9032258064516129|
|      student|             0.6938775510204082|
|     salesman|                           0.75|
|       artist|             0.5357142857142857|
|   technician|             0.9629629629629629|
|administrator|             0.5443037974683544|
|     engineer|             0.9701492537313433|
|   healthcare|                         0.3125|
|     educator|             0.7263157894

### Step 6. For each occupation, calculate the minimum and maximum ages

In [47]:
users.groupby("occupation").agg(F.min("age"), F.max("age")).show()

+-------------+--------+--------+
|   occupation|min(age)|max(age)|
+-------------+--------+--------+
|    librarian|      23|      69|
|      retired|      51|      73|
|       lawyer|      21|      53|
|         none|      11|      55|
|       writer|      18|      60|
|   programmer|      20|      63|
|    marketing|      24|      55|
|        other|      13|      64|
|    executive|      22|      69|
|    scientist|      23|      55|
|      student|       7|      42|
|     salesman|      18|      66|
|       artist|      19|      48|
|   technician|      21|      55|
|administrator|      21|      70|
|     engineer|      22|      70|
|   healthcare|      22|      62|
|     educator|      23|      63|
|entertainment|      15|      50|
|    homemaker|      20|      50|
+-------------+--------+--------+
only showing top 20 rows



### Step 7. For each combination of occupation and gender, calculate the mean age

In [50]:
group = ["occupation", "gender"]
cols = ["age"]
funs = [F.avg]
exprs = [f(F.col(col)) for col in cols for f in funs]

In [55]:
users.groupby(*group).agg(*exprs).sort("occupation", descending=False).show()

+-------------+------+------------------+
|   occupation|gender|          avg(age)|
+-------------+------+------------------+
|administrator|     M| 37.16279069767442|
|administrator|     F|40.638888888888886|
|       artist|     F|30.307692307692307|
|       artist|     M|32.333333333333336|
|       doctor|     M| 43.57142857142857|
|     educator|     M| 43.10144927536232|
|     educator|     F| 39.11538461538461|
|     engineer|     M|              36.6|
|     engineer|     F|              29.5|
|entertainment|     F|              31.0|
|entertainment|     M|              29.0|
|    executive|     M|38.172413793103445|
|    executive|     F|              44.0|
|   healthcare|     M|              45.4|
|   healthcare|     F| 39.81818181818182|
|    homemaker|     F|34.166666666666664|
|    homemaker|     M|              23.0|
|       lawyer|     F|              39.5|
|       lawyer|     M|              36.2|
|    librarian|     M|              40.0|
+-------------+------+------------

### Step 8.  For each occupation present the percentage of women and men