# Occupation

### Introduction:

Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

### Step 1. Import the necessary libraries

In [6]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("PySparkPrac").getOrCreate()

In [16]:
import requests
from pyspark.sql.functions import *

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user). 

### Step 3. Assign it to a variable called users.

In [None]:
url = "https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user"
file_path = "/home/neosoft/Documents/Practice/Daily_Practice/21st_Aug/u.user"

with open(file_path, "wb") as f:
    f.write(requests.get(url).content)

In [20]:
users = spark.read.option("header",True).option("delimiter","|").csv(file_path)

In [21]:
users.show(5)

+-------+---+------+----------+--------+
|user_id|age|gender|occupation|zip_code|
+-------+---+------+----------+--------+
|      1| 24|     M|technician|   85711|
|      2| 53|     F|     other|   94043|
|      3| 23|     M|    writer|   32067|
|      4| 24|     M|technician|   43537|
|      5| 33|     F|     other|   15213|
+-------+---+------+----------+--------+
only showing top 5 rows



### Step 4. Discover what is the mean age per occupation

In [22]:
users.groupBy("occupation").agg(avg(users.age).alias("Mean_Age")).show()

+-------------+------------------+
|   occupation|          Mean_Age|
+-------------+------------------+
|    librarian|              40.0|
|      retired| 63.07142857142857|
|       lawyer|             36.75|
|         none|26.555555555555557|
|       writer| 36.31111111111111|
|   programmer|33.121212121212125|
|    marketing| 37.61538461538461|
|        other|34.523809523809526|
|    executive|          38.71875|
|    scientist| 35.54838709677419|
|      student|22.081632653061224|
|     salesman|35.666666666666664|
|       artist|31.392857142857142|
|   technician|33.148148148148145|
|administrator| 38.74683544303797|
|     engineer| 36.38805970149254|
|   healthcare|           41.5625|
|     educator| 42.01052631578948|
|entertainment| 29.22222222222222|
|    homemaker| 32.57142857142857|
+-------------+------------------+
only showing top 20 rows



### Step 5. Discover the Male ratio per occupation and sort it from the most to the least

In [23]:
users.groupBy(users.occupation).agg(sum(when(col("gender")=="M",1).otherwise(0)).alias("male_count"), count("*").alias("total_count")).withColumn("male_ratio", col("male_count")/col("total_count")).orderBy(col("male_ratio").desc()).show()

+-------------+----------+-----------+-------------------+
|   occupation|male_count|total_count|         male_ratio|
+-------------+----------+-----------+-------------------+
|       doctor|         7|          7|                1.0|
|     engineer|        65|         67| 0.9701492537313433|
|   technician|        26|         27| 0.9629629629629629|
|      retired|        13|         14| 0.9285714285714286|
|   programmer|        60|         66| 0.9090909090909091|
|    executive|        29|         32|            0.90625|
|    scientist|        28|         31| 0.9032258064516129|
|entertainment|        16|         18| 0.8888888888888888|
|       lawyer|        10|         12| 0.8333333333333334|
|     salesman|         9|         12|               0.75|
|     educator|        69|         95| 0.7263157894736842|
|      student|       136|        196| 0.6938775510204082|
|        other|        69|        105| 0.6571428571428571|
|    marketing|        16|         26| 0.615384615384615

### Step 6. For each occupation, calculate the minimum and maximum ages

In [18]:
users.groupBy(users.occupation).agg(min(users.age).alias('Min_age'), max(users.age).alias('Max_age')).show()

+-------------+-------+-------+
|   occupation|Min_age|Max_age|
+-------------+-------+-------+
|administrator|     21|     70|
|       artist|     19|     48|
|       doctor|     28|     64|
|     educator|     23|     63|
|     engineer|     22|     70|
|entertainment|     15|     50|
|    executive|     22|     69|
|   healthcare|     22|     62|
|    homemaker|     20|     50|
|       lawyer|     21|     53|
|    librarian|     23|     69|
|    marketing|     24|     55|
|         none|     11|     55|
|        other|     13|     64|
|   programmer|     20|     63|
|      retired|     51|     73|
|     salesman|     18|     66|
|    scientist|     23|     55|
|      student|     10|      7|
|   technician|     21|     55|
+-------------+-------+-------+
only showing top 20 rows



### Step 7. For each combination of occupation and gender, calculate the mean age

In [19]:
users.groupBy(users.occupation, users.gender).agg(avg(users.age)).show()

+-------------+------+------------------+
|   occupation|gender|          avg(age)|
+-------------+------+------------------+
|   technician|     M| 32.96153846153846|
|     educator|     F| 39.11538461538461|
|       lawyer|     F|              39.5|
|entertainment|     F|              31.0|
|       lawyer|     M|              36.2|
|      retired|     F|              70.0|
|      student|     F|             20.75|
|   healthcare|     F| 39.81818181818182|
|administrator|     M| 37.16279069767442|
|    marketing|     M|            37.875|
|     engineer|     F|              29.5|
|    homemaker|     F|34.166666666666664|
|       artist|     F|30.307692307692307|
|         none|     F|              36.5|
|       doctor|     M| 43.57142857142857|
|       writer|     F| 37.63157894736842|
|     educator|     M| 43.10144927536232|
|    scientist|     M| 36.32142857142857|
|   technician|     F|              38.0|
|       writer|     M| 35.34615384615385|
+-------------+------+------------

### Step 8.  For each occupation present the percentage of women and men

In [32]:
users.groupBy(users.occupation).agg(sum(when(col("gender")=="M",1).otherwise(0)).alias("male_count"),sum(when(col("gender")=="F",1).otherwise(0)).alias("female_count"), count("*").alias("total_count")).withColumn("male_percent", col("male_count")/col("total_count") * 100).withColumn("female_percent", col("female_count")/col("total_count") * 100).orderBy(col("occupation")).show()

+-------------+----------+------------+-----------+------------------+------------------+
|   occupation|male_count|female_count|total_count|      male_percent|    female_percent|
+-------------+----------+------------+-----------+------------------+------------------+
|administrator|        43|          36|         79| 54.43037974683544| 45.56962025316456|
|       artist|        15|          13|         28| 53.57142857142857| 46.42857142857143|
|       doctor|         7|           0|          7|             100.0|               0.0|
|     educator|        69|          26|         95| 72.63157894736842|27.368421052631582|
|     engineer|        65|           2|         67| 97.01492537313433|2.9850746268656714|
|entertainment|        16|           2|         18| 88.88888888888889| 11.11111111111111|
|    executive|        29|           3|         32|            90.625|             9.375|
|   healthcare|         5|          11|         16|             31.25|             68.75|
|    homem