## <center style="color:Blue">World Happiness Dataset with Spark</center>

<center><img src="https://steemitimages.com/1280x0/https://www.psychologies.co.uk/sites/default/files/styles/psy2_page_header/public/wp-content/uploads/2012/03/happy.jpg"></center>

# <center>Importing spark and data</center>


In [2]:
!pip install pyspark #install pySpark

Collecting pyspark
  Downloading pyspark-3.1.2.tar.gz (212.4 MB)
[K     |████████████████████████████████| 212.4 MB 52 kB/s s eta 0:00:01    |██████████▎                     | 68.5 MB 35.8 MB/s eta 0:00:05     |█████████████████               | 113.2 MB 42.3 MB/s eta 0:00:03
[?25hCollecting py4j==0.10.9
  Downloading py4j-0.10.9-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 50.0 MB/s eta 0:00:01
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25ldone
[?25h  Created wheel for pyspark: filename=pyspark-3.1.2-py2.py3-none-any.whl size=212880768 sha256=ce37668b0f7e3ea135918f0523b6da581b36a260fb8b2e12c2e017b5644ea959
  Stored in directory: /root/.cache/pip/wheels/a5/0a/c1/9561f6fecb759579a7d863dcd846daaa95f598744e71b02c77
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.1.2


In [3]:
import pyspark
from pyspark.sql import SparkSession

In [4]:
spark= SparkSession.builder.getOrCreate()

In [5]:
path= "DataSet/world-happiness-report.csv"
data= spark.read.csv(path)

In [6]:
data.show(10)

+------------+----+-----------+------------------+--------------+--------------------+--------------------+----------+--------------------+---------------+---------------+
|         _c0| _c1|        _c2|               _c3|           _c4|                 _c5|                 _c6|       _c7|                 _c8|            _c9|           _c10|
+------------+----+-----------+------------------+--------------+--------------------+--------------------+----------+--------------------+---------------+---------------+
|Country name|year|Life Ladder|Log GDP per capita|Social support|Healthy life expe...|Freedom to make l...|Generosity|Perceptions of co...|Positive affect|Negative affect|
| Afghanistan|2008|      3.724|             7.370|         0.451|              50.800|               0.718|     0.168|               0.882|          0.518|          0.258|
| Afghanistan|2009|      4.402|             7.540|         0.552|              51.200|               0.679|     0.190|               0.850| 

## <center>Let's make it looks better</center>

In [7]:
data.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: string (nullable = true)
 |-- _c5: string (nullable = true)
 |-- _c6: string (nullable = true)
 |-- _c7: string (nullable = true)
 |-- _c8: string (nullable = true)
 |-- _c9: string (nullable = true)
 |-- _c10: string (nullable = true)



## 1.1 Creat a schema 

In [8]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, FloatType

In [9]:
new_schema= StructType([StructField("Country", StringType(), True),
                        StructField("Year", StringType(), True),
                        StructField("Life_Ladder", StringType(), True),
                        StructField("Log_Gdp", StringType(), True),
                        StructField("Social_support", StringType(), True),
                        StructField("Healthy_life_expectancy", StringType(), True),
                        StructField("Freedom_to_make_life_choices ", StringType(), True),
                        StructField("Generosity", StringType(), True),
                        StructField("Perceptions_of_corruption", StringType(), True),
                        StructField("Positive_affect", StringType(), True),
                        StructField("Negative_affect", StringType(), True)
                       ])

In [10]:
data= spark.read.csv(path, schema=new_schema)
data.show(3)

+------------+----+-----------+------------------+--------------+-----------------------+-----------------------------+----------+-------------------------+---------------+---------------+
|     Country|Year|Life_Ladder|           Log_Gdp|Social_support|Healthy_life_expectancy|Freedom_to_make_life_choices |Generosity|Perceptions_of_corruption|Positive_affect|Negative_affect|
+------------+----+-----------+------------------+--------------+-----------------------+-----------------------------+----------+-------------------------+---------------+---------------+
|Country name|year|Life Ladder|Log GDP per capita|Social support|   Healthy life expe...|         Freedom to make l...|Generosity|     Perceptions of co...|Positive affect|Negative affect|
| Afghanistan|2008|      3.724|             7.370|         0.451|                 50.800|                        0.718|     0.168|                    0.882|          0.518|          0.258|
| Afghanistan|2009|      4.402|             7.540|     

## 1.2 Changing datatype

In [11]:
for i in data.columns:
    if i != "Country":
        data= data.withColumn(i+"_",data[i].cast(FloatType())).drop(i)
data= data.withColumn('Year',data['Year_'].cast(IntegerType())).drop("Year_")

In [12]:
data.printSchema()

root
 |-- Country: string (nullable = true)
 |-- Life_Ladder_: float (nullable = true)
 |-- Log_Gdp_: float (nullable = true)
 |-- Social_support_: float (nullable = true)
 |-- Healthy_life_expectancy_: float (nullable = true)
 |-- Freedom_to_make_life_choices _: float (nullable = true)
 |-- Generosity_: float (nullable = true)
 |-- Perceptions_of_corruption_: float (nullable = true)
 |-- Positive_affect_: float (nullable = true)
 |-- Negative_affect_: float (nullable = true)
 |-- Year: integer (nullable = true)



In [13]:
data.show(5)

+------------+------------+--------+---------------+------------------------+------------------------------+-----------+--------------------------+----------------+----------------+----+
|     Country|Life_Ladder_|Log_Gdp_|Social_support_|Healthy_life_expectancy_|Freedom_to_make_life_choices _|Generosity_|Perceptions_of_corruption_|Positive_affect_|Negative_affect_|Year|
+------------+------------+--------+---------------+------------------------+------------------------------+-----------+--------------------------+----------------+----------------+----+
|Country name|        null|    null|           null|                    null|                          null|       null|                      null|            null|            null|null|
| Afghanistan|       3.724|    7.37|          0.451|                    50.8|                         0.718|      0.168|                     0.882|           0.518|           0.258|2008|
| Afghanistan|       4.402|    7.54|          0.552|             

In [14]:
data.show(4)

+------------+------------+--------+---------------+------------------------+------------------------------+-----------+--------------------------+----------------+----------------+----+
|     Country|Life_Ladder_|Log_Gdp_|Social_support_|Healthy_life_expectancy_|Freedom_to_make_life_choices _|Generosity_|Perceptions_of_corruption_|Positive_affect_|Negative_affect_|Year|
+------------+------------+--------+---------------+------------------------+------------------------------+-----------+--------------------------+----------------+----------------+----+
|Country name|        null|    null|           null|                    null|                          null|       null|                      null|            null|            null|null|
| Afghanistan|       3.724|    7.37|          0.451|                    50.8|                         0.718|      0.168|                     0.882|           0.518|           0.258|2008|
| Afghanistan|       4.402|    7.54|          0.552|             

## 1.3 Drop null values

In [15]:
print(data.count())
data=data.na.drop()
data.show(5)
print(data.count())

1950
+-----------+------------+--------+---------------+------------------------+------------------------------+-----------+--------------------------+----------------+----------------+----+
|    Country|Life_Ladder_|Log_Gdp_|Social_support_|Healthy_life_expectancy_|Freedom_to_make_life_choices _|Generosity_|Perceptions_of_corruption_|Positive_affect_|Negative_affect_|Year|
+-----------+------------+--------+---------------+------------------------+------------------------------+-----------+--------------------------+----------------+----------------+----+
|Afghanistan|       3.724|    7.37|          0.451|                    50.8|                         0.718|      0.168|                     0.882|           0.518|           0.258|2008|
|Afghanistan|       4.402|    7.54|          0.552|                    51.2|                         0.679|       0.19|                      0.85|           0.584|           0.237|2009|
|Afghanistan|       4.758|   7.647|          0.539|              

# <center> Data Exploration </center>

## Question2: How many Country there are?

In [16]:
country_data = data.groupBy('Country').count().show()

+-----------+-----+
|    Country|count|
+-----------+-----+
|       Chad|   14|
|   Paraguay|   13|
|     Russia|   15|
|      Yemen|    7|
|    Senegal|   14|
|     Sweden|   14|
|     Guyana|    1|
|Philippines|   14|
|   Djibouti|    3|
|   Malaysia|   12|
|  Singapore|   12|
|     Turkey|   14|
|     Malawi|   12|
|       Iraq|   11|
|    Germany|   13|
|    Comoros|    6|
|Afghanistan|   12|
|   Cambodia|   12|
|Ivory Coast|    9|
|     Jordan|    2|
+-----------+-----+
only showing top 20 rows



## Question2: How many data rows collected from Afghanistan?

In [17]:
Af_data= data.filter(data['Country']=="Afghanistan")
Af_data= Af_data.drop('Country')
print("Number of rows that collected from Afghanistan =", Af_data.count())
Af_data.show(12)

Number of rows that collected from Afghanistan = 12
+------------+--------+---------------+------------------------+------------------------------+-----------+--------------------------+----------------+----------------+----+
|Life_Ladder_|Log_Gdp_|Social_support_|Healthy_life_expectancy_|Freedom_to_make_life_choices _|Generosity_|Perceptions_of_corruption_|Positive_affect_|Negative_affect_|Year|
+------------+--------+---------------+------------------------+------------------------------+-----------+--------------------------+----------------+----------------+----+
|       3.724|    7.37|          0.451|                    50.8|                         0.718|      0.168|                     0.882|           0.518|           0.258|2008|
|       4.402|    7.54|          0.552|                    51.2|                         0.679|       0.19|                      0.85|           0.584|           0.237|2009|
|       4.758|   7.647|          0.539|                    51.6|              

## Question3: what is tha average of Positive and Negative affect in Afghanistan?

In [18]:
Af_data.groupBy().avg('Positive_affect_').show()
Af_data.groupBy().avg('Negative_affect_').show()

+---------------------+
|avg(Positive_affect_)|
+---------------------+
|   0.5486666634678841|
+---------------------+

+---------------------+
|avg(Negative_affect_)|
+---------------------+
|   0.3264999948441982|
+---------------------+



# <center>Adding 2021 Dataset to the main data</center>

In [47]:
path_2021= "DataSet/world-happiness-report-2021.csv"
data_2021= spark.read.csv(path_2021, schema=new_schema)

In [48]:
data_2021.show(3)

+------------+------------------+------------+--------------------+--------------+-----------------------+-----------------------------+--------------+-------------------------+--------------------+---------------+
|     Country|              Year| Life_Ladder|             Log_Gdp|Social_support|Healthy_life_expectancy|Freedom_to_make_life_choices |    Generosity|Perceptions_of_corruption|     Positive_affect|Negative_affect|
+------------+------------------+------------+--------------------+--------------+-----------------------+-----------------------------+--------------+-------------------------+--------------------+---------------+
|Country name|Regional indicator|Ladder score|Standard error of...|  upperwhisker|           lowerwhisker|         Logged GDP per ca...|Social support|     Healthy life expe...|Freedom to make l...|     Generosity|
|     Finland|    Western Europe|       7.842|               0.032|         7.904|                  7.780|                       10.775|    

In [49]:
for i in data_2021.columns:
    if i != "Country":
        data_2021= data_2021.withColumn(i+"_",data_2021[i].cast(FloatType())).drop(i)
data_2021= data_2021.withColumn('Year',data_2021['Year_'].cast(IntegerType())).drop("Year_")


In [50]:
data_2021.show(10)

+------------+------------+--------+---------------+------------------------+------------------------------+-----------+--------------------------+----------------+----------------+----+
|     Country|Life_Ladder_|Log_Gdp_|Social_support_|Healthy_life_expectancy_|Freedom_to_make_life_choices _|Generosity_|Perceptions_of_corruption_|Positive_affect_|Negative_affect_|Year|
+------------+------------+--------+---------------+------------------------+------------------------------+-----------+--------------------------+----------------+----------------+----+
|Country name|        null|    null|           null|                    null|                          null|       null|                      null|            null|            null|null|
|     Finland|       7.842|   0.032|          7.904|                    7.78|                        10.775|      0.954|                      72.0|           0.949|          -0.098|null|
|     Denmark|        7.62|   0.035|          7.687|             

In [51]:
all_data= data.union(data_2021)

In [52]:
all_data.show(5)

+-----------+------------+--------+---------------+------------------------+------------------------------+-----------+--------------------------+----------------+----------------+----+
|    Country|Life_Ladder_|Log_Gdp_|Social_support_|Healthy_life_expectancy_|Freedom_to_make_life_choices _|Generosity_|Perceptions_of_corruption_|Positive_affect_|Negative_affect_|Year|
+-----------+------------+--------+---------------+------------------------+------------------------------+-----------+--------------------------+----------------+----------------+----+
|Afghanistan|       3.724|    7.37|          0.451|                    50.8|                         0.718|      0.168|                     0.882|           0.518|           0.258|2008|
|Afghanistan|       4.402|    7.54|          0.552|                    51.2|                         0.679|       0.19|                      0.85|           0.584|           0.237|2009|
|Afghanistan|       4.758|   7.647|          0.539|                   