<a href="https://colab.research.google.com/github/marymlucas/obesity_lifestyle_diet/blob/main/project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Project - Predict obesity of individuals based on diet and lifestyle habits

In [1]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null


In [2]:
!wget https://dlcdn.apache.org/spark/spark-3.0.3/spark-3.0.3-bin-hadoop2.7.tgz

--2022-03-02 23:50:29--  https://dlcdn.apache.org/spark/spark-3.0.3/spark-3.0.3-bin-hadoop2.7.tgz
Resolving dlcdn.apache.org (dlcdn.apache.org)... 151.101.2.132, 2a04:4e42::644
Connecting to dlcdn.apache.org (dlcdn.apache.org)|151.101.2.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 220400553 (210M) [application/x-gzip]
Saving to: ‘spark-3.0.3-bin-hadoop2.7.tgz’


2022-03-02 23:50:30 (252 MB/s) - ‘spark-3.0.3-bin-hadoop2.7.tgz’ saved [220400553/220400553]



In [3]:
!tar xf spark-3.0.3-bin-hadoop2.7.tgz

In [4]:
!pip install -q findspark

In [7]:
!pip install pyspark==3.0.3

Collecting pyspark==3.0.3
  Downloading pyspark-3.0.3.tar.gz (209.1 MB)
[K     |████████████████████████████████| 209.1 MB 64 kB/s 
[?25hCollecting py4j==0.10.9
  Downloading py4j-0.10.9-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 44.9 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.0.3-py2.py3-none-any.whl size=209435971 sha256=27edb0e055f397fabe2fc5fbfcb492a471e8e4b22cd058d8e81238054e607c94
  Stored in directory: /root/.cache/pip/wheels/7e/6d/0a/6b0bf301bc056d9af03194b732b9f49ad2fceb205aab2984fd
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.0.3


In [5]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "spark-3.0.3-bin-hadoop2.7"

In [9]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [10]:
from pyspark.sql import SparkSession

In [11]:
APP_NAME = "Final Project"

In [12]:
spark = SparkSession.builder.appName(APP_NAME).getOrCreate()

In [13]:
spark

In [14]:
import pandas as pd
import numpy as np
from scipy.io.arff import loadarff 


# Data Import and Exploration

In [15]:
raw_data = spark.read.format('csv').option('header', 'true').option('inferSchema', 'true').load('/content/drive/MyDrive/Colab Notebooks/DSCI-632/project/data/ObesityDataSet_raw_and_data_sinthetic.csv')

In [16]:
raw_data.show(5)

+------+----+------+------+------------------------------+----+----+---+---------+-----+----+---+---+---+----------+--------------------+-------------------+
|Gender| Age|Height|Weight|family_history_with_overweight|FAVC|FCVC|NCP|     CAEC|SMOKE|CH2O|SCC|FAF|TUE|      CALC|              MTRANS|         NObeyesdad|
+------+----+------+------+------------------------------+----+----+---+---------+-----+----+---+---+---+----------+--------------------+-------------------+
|Female|21.0|  1.62|  64.0|                           yes|  no| 2.0|3.0|Sometimes|   no| 2.0| no|0.0|1.0|        no|Public_Transporta...|      Normal_Weight|
|Female|21.0|  1.52|  56.0|                           yes|  no| 3.0|3.0|Sometimes|  yes| 3.0|yes|3.0|0.0| Sometimes|Public_Transporta...|      Normal_Weight|
|  Male|23.0|   1.8|  77.0|                           yes|  no| 2.0|3.0|Sometimes|   no| 2.0| no|2.0|1.0|Frequently|Public_Transporta...|      Normal_Weight|
|  Male|27.0|   1.8|  87.0|                         

In [17]:
# the data attributes and types are in the top part of the arff file
!pip install liac-arff

Collecting liac-arff
  Downloading liac-arff-2.5.0.tar.gz (13 kB)
Building wheels for collected packages: liac-arff
  Building wheel for liac-arff (setup.py) ... [?25l[?25hdone
  Created wheel for liac-arff: filename=liac_arff-2.5.0-py3-none-any.whl size=11732 sha256=7c7fd7c1e281dc138b92536d163417ecc01e6d3f83682479898e4cdd2b5d3aca
  Stored in directory: /root/.cache/pip/wheels/1f/0f/15/332ca86cbebf25ddf98518caaf887945fbe1712b97a0f2493b
Successfully built liac-arff
Installing collected packages: liac-arff
Successfully installed liac-arff-2.5.0


In [18]:
import arff

with open('/content/drive/MyDrive/Colab Notebooks/DSCI-632/project/data/ObesityDataSet_raw_and_data_sinthetic.arff') as handle:
  data = arff.load(handle)

#print(data['attributes'])
for attribute in data['attributes']:
  print(attribute)

('Gender', ['Female', 'Male'])
('Age', 'NUMERIC')
('Height', 'NUMERIC')
('Weight', 'NUMERIC')
('family_history_with_overweight', ['yes', 'no'])
('FAVC', ['yes', 'no'])
('FCVC', 'NUMERIC')
('NCP', 'NUMERIC')
('CAEC', ['no', 'Sometimes', 'Frequently', 'Always'])
('SMOKE', ['yes', 'no'])
('CH2O', 'NUMERIC')
('SCC', ['yes', 'no'])
('FAF', 'NUMERIC')
('TUE', 'NUMERIC')
('CALC', ['no', 'Sometimes', 'Frequently', 'Always'])
('MTRANS', ['Automobile', 'Motorbike', 'Bike', 'Public_Transportation', 'Walking'])
('NObeyesdad', ['Insufficient_Weight', 'Normal_Weight', 'Overweight_Level_I', 'Overweight_Level_II', 'Obesity_Type_I', 'Obesity_Type_II', 'Obesity_Type_III'])


## EXPLORATORY DATA ANALYSIS

In [19]:
raw_data.describe()

DataFrame[summary: string, Gender: string, Age: string, Height: string, Weight: string, family_history_with_overweight: string, FAVC: string, FCVC: string, NCP: string, CAEC: string, SMOKE: string, CH2O: string, SCC: string, FAF: string, TUE: string, CALC: string, MTRANS: string, NObeyesdad: string]