# Data Exploration and Cleaning 

# Obesity Dataset

The Obesity dataset has been chosen for the project since obesity is a universal problem and an intersectional one. It touches on medicine, genetics, psychology, sociology, urban planning, and even economics Obesity is a chronic disease; it is recognized as the fifth leading cause of mortality worldwide and is a major public health concern. Age, weight, height, body mass index, as well as other hereditary and lifestyle variables, are only a few of the causes.

The dataset columns description Frequent consumption of high caloric food (FAVC), Frequency of consumption of vegetables (FCVC), Number of main meals (NCP), Consumption of food between meals (CAEC), Consumption of water daily (CH20), Consumption of alcohol (CALC). The attributes related with the physical condition are: Calories consumption monitoring (SCC), Physical activity frequency (FAF), Time using technology devices (TUE), Transportation used (MTRANS),

In [1]:
#Import Libraries
import pandas as pd

In [2]:
# Load the data from CSV file
data = pd.read_csv("Obesity.csv")

In [3]:
data.head()

Unnamed: 0,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,NObeyesdad
0,Female,21.0,1.62,64.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,0.0,1.0,no,Public_Transportation,Normal_Weight
1,Female,21.0,1.52,56.0,yes,no,3.0,3.0,Sometimes,yes,3.0,yes,3.0,0.0,Sometimes,Public_Transportation,Normal_Weight
2,Male,23.0,1.8,77.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,2.0,1.0,Frequently,Public_Transportation,Normal_Weight
3,Male,27.0,1.8,87.0,no,no,3.0,3.0,Sometimes,no,2.0,no,2.0,0.0,Frequently,Walking,Overweight_Level_I
4,Male,22.0,1.78,89.8,no,no,2.0,1.0,Sometimes,no,2.0,no,0.0,0.0,Sometimes,Public_Transportation,Overweight_Level_II


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2111 entries, 0 to 2110
Data columns (total 17 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Gender                          2111 non-null   object 
 1   Age                             2111 non-null   float64
 2   Height                          2111 non-null   float64
 3   Weight                          2111 non-null   float64
 4   family_history_with_overweight  2111 non-null   object 
 5   FAVC                            2111 non-null   object 
 6   FCVC                            2111 non-null   float64
 7   NCP                             2111 non-null   float64
 8   CAEC                            2111 non-null   object 
 9   SMOKE                           2111 non-null   object 
 10  CH2O                            2111 non-null   float64
 11  SCC                             2111 non-null   object 
 12  FAF                             21

In [5]:
data.isnull().sum()

Gender                            0
Age                               0
Height                            0
Weight                            0
family_history_with_overweight    0
FAVC                              0
FCVC                              0
NCP                               0
CAEC                              0
SMOKE                             0
CH2O                              0
SCC                               0
FAF                               0
TUE                               0
CALC                              0
MTRANS                            0
NObeyesdad                        0
dtype: int64

In [6]:
data['NObeyesdad'].value_counts()


Obesity_Type_I         351
Obesity_Type_III       324
Obesity_Type_II        297
Overweight_Level_I     290
Overweight_Level_II    290
Normal_Weight          287
Insufficient_Weight    272
Name: NObeyesdad, dtype: int64

In [7]:
data.shape

(2111, 17)

In [8]:
#The function to add BMI column 
def calculate_bmi(height, weight):
    # Formula to calculate BMI: BMI = weight (kg) / height**2 (m2)
    bmi = weight / (height ** 2)
    return round(bmi, 2)  # Round to 2 decimal places


# Add a new 'BMI' column to the existing DataFrame
data['BMI'] = data.apply(lambda row: calculate_bmi(row['Height'], row['Weight']), axis=1)

data.head()

Unnamed: 0,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,NObeyesdad,BMI
0,Female,21.0,1.62,64.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,0.0,1.0,no,Public_Transportation,Normal_Weight,24.39
1,Female,21.0,1.52,56.0,yes,no,3.0,3.0,Sometimes,yes,3.0,yes,3.0,0.0,Sometimes,Public_Transportation,Normal_Weight,24.24
2,Male,23.0,1.8,77.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,2.0,1.0,Frequently,Public_Transportation,Normal_Weight,23.77
3,Male,27.0,1.8,87.0,no,no,3.0,3.0,Sometimes,no,2.0,no,2.0,0.0,Frequently,Walking,Overweight_Level_I,26.85
4,Male,22.0,1.78,89.8,no,no,2.0,1.0,Sometimes,no,2.0,no,0.0,0.0,Sometimes,Public_Transportation,Overweight_Level_II,28.34


***Check if there are any mistakes on the classification of  NObeyesdad Obesity classes columns*** 

In [9]:
# Group by the 'NObeyesdad' column and get the minimum BMI for each category
min_bmi_by_category = data.groupby('NObeyesdad')['BMI'].max()
print('Maximun values for each category')
print(min_bmi_by_category)

# Group by the 'NObeyesdad' column and get the minimum BMI for each category
min_bmi_by_category = data.groupby('NObeyesdad')['BMI'].min()
print('Minimum values for each category')
print(min_bmi_by_category)
print('-----------------------------------------------------------------------------------------------------')
print('The results are compared with WHO Obesity Classification')
print('WHO Classification')
print('Underweight BMI < 18.5\nNormal weigh BMI = 18.5 – 24.9\nOverweight BMI = 25.0 – 29.9\nObesity\na. Class I: BMI = 30.0 – 34.9\nb. Class II: BMI = 35.0 – 39.9\nc. Class III: BMI ≥ 40.0')

Maximun values for each category
NObeyesdad
Insufficient_Weight    19.08
Normal_Weight          24.91
Obesity_Type_I         35.17
Obesity_Type_II        39.79
Obesity_Type_III       50.81
Overweight_Level_I     28.77
Overweight_Level_II    30.36
Name: BMI, dtype: float64
Minimum values for each category
NObeyesdad
Insufficient_Weight    13.00
Normal_Weight          18.49
Obesity_Type_I         29.91
Obesity_Type_II        34.05
Obesity_Type_III       36.77
Overweight_Level_I     22.83
Overweight_Level_II    25.71
Name: BMI, dtype: float64
-----------------------------------------------------------------------------------------------------
The results are compared with WHO Obesity Classification
WHO Classification
Underweight BMI < 18.5
Normal weigh BMI = 18.5 – 24.9
Overweight BMI = 25.0 – 29.9
Obesity
a. Class I: BMI = 30.0 – 34.9
b. Class II: BMI = 35.0 – 39.9
c. Class III: BMI ≥ 40.0


In [10]:
# Get the counts of all unique values in the 'NObeyesdad' column
counts = data['NObeyesdad'].value_counts()

# Print the counts for "Overweight_Level_I" and "Overweight_Level_II"
overweight_level_I_count = counts.get('Overweight_Level_I', 0)
overweight_level_II_count = counts.get('Overweight_Level_II', 0)

print(f"There are {overweight_level_I_count} occurrences of 'Overweight_Level_I' in the 'NObeyesdad' column.")
print(f"There are {overweight_level_II_count} occurrences of 'Overweight_Level_II' in the 'NObeyesdad' column.")
# Sum the counts
total_overweight_counts = overweight_level_I_count + overweight_level_II_count

print(f"There are {total_overweight_counts} total occurrences of 'Overweight_Level_I' and 'Overweight_Level_II' in the 'NObeyesdad' column.")

There are 290 occurrences of 'Overweight_Level_I' in the 'NObeyesdad' column.
There are 290 occurrences of 'Overweight_Level_II' in the 'NObeyesdad' column.
There are 580 total occurrences of 'Overweight_Level_I' and 'Overweight_Level_II' in the 'NObeyesdad' column.


***Obesity classification needs to be corrected based on WHO  Classification***

In [11]:
#there were some errors so the obesity levels are ordered again acording to WHO  
def classify_bmi(bmi):
    if bmi < 18.5:
        return 'Insufficient_Weight'
    elif 18.5 <= bmi < 25:
        return 'Normal_Weight'
    elif 25 <= bmi < 30:
        return 'Overweight'
    elif 30 <= bmi < 35:
        return 'Obesity_Type_I'
    elif 35 <= bmi < 40:
        return 'Obesity_Type_II'
    else:
        return 'Obesity_Type_III'

# Add the 'Classes' column to the DataFrame by applying the classify_bmi function to the 'BMI' column
data['Classes'] = data['BMI'].apply(classify_bmi)

# Display the first 5 rows to see the new column
data.head()

Unnamed: 0,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,NObeyesdad,BMI,Classes
0,Female,21.0,1.62,64.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,0.0,1.0,no,Public_Transportation,Normal_Weight,24.39,Normal_Weight
1,Female,21.0,1.52,56.0,yes,no,3.0,3.0,Sometimes,yes,3.0,yes,3.0,0.0,Sometimes,Public_Transportation,Normal_Weight,24.24,Normal_Weight
2,Male,23.0,1.8,77.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,2.0,1.0,Frequently,Public_Transportation,Normal_Weight,23.77,Normal_Weight
3,Male,27.0,1.8,87.0,no,no,3.0,3.0,Sometimes,no,2.0,no,2.0,0.0,Frequently,Walking,Overweight_Level_I,26.85,Overweight
4,Male,22.0,1.78,89.8,no,no,2.0,1.0,Sometimes,no,2.0,no,0.0,0.0,Sometimes,Public_Transportation,Overweight_Level_II,28.34,Overweight


In [12]:
#compare the new column Classes and the NObeyesedad column to see how many errors there were.
#create a new column called comparison and put 0 when they are edentical and 1 when they are different.
data['comparison'] = (data['NObeyesdad'] != data['Classes']).astype(int)

# Display the first 5 rows to see the new column
data.head()


Unnamed: 0,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,NObeyesdad,BMI,Classes,comparison
0,Female,21.0,1.62,64.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,0.0,1.0,no,Public_Transportation,Normal_Weight,24.39,Normal_Weight,0
1,Female,21.0,1.52,56.0,yes,no,3.0,3.0,Sometimes,yes,3.0,yes,3.0,0.0,Sometimes,Public_Transportation,Normal_Weight,24.24,Normal_Weight,0
2,Male,23.0,1.8,77.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,2.0,1.0,Frequently,Public_Transportation,Normal_Weight,23.77,Normal_Weight,0
3,Male,27.0,1.8,87.0,no,no,3.0,3.0,Sometimes,no,2.0,no,2.0,0.0,Frequently,Walking,Overweight_Level_I,26.85,Overweight,1
4,Male,22.0,1.78,89.8,no,no,2.0,1.0,Sometimes,no,2.0,no,0.0,0.0,Sometimes,Public_Transportation,Overweight_Level_II,28.34,Overweight,1


In [13]:
#count how many ones we have 
number_of_ones = data['comparison'].sum()

print(f"There are {number_of_ones} ones in the 'comparison' column.")


There are 665 ones in the 'comparison' column.


In [14]:
# Replace "Overweight_Level_I" and "Overweight_Level_II" with "Overweight" in the 'NObeyesdad' column
data = data.replace({'NObeyesdad': {'Overweight_Level_I': 'Overweight', 'Overweight_Level_II': 'Overweight'}})

# Display the first 5 rows to see the changes
data.head()


Unnamed: 0,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,NObeyesdad,BMI,Classes,comparison
0,Female,21.0,1.62,64.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,0.0,1.0,no,Public_Transportation,Normal_Weight,24.39,Normal_Weight,0
1,Female,21.0,1.52,56.0,yes,no,3.0,3.0,Sometimes,yes,3.0,yes,3.0,0.0,Sometimes,Public_Transportation,Normal_Weight,24.24,Normal_Weight,0
2,Male,23.0,1.8,77.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,2.0,1.0,Frequently,Public_Transportation,Normal_Weight,23.77,Normal_Weight,0
3,Male,27.0,1.8,87.0,no,no,3.0,3.0,Sometimes,no,2.0,no,2.0,0.0,Frequently,Walking,Overweight,26.85,Overweight,1
4,Male,22.0,1.78,89.8,no,no,2.0,1.0,Sometimes,no,2.0,no,0.0,0.0,Sometimes,Public_Transportation,Overweight,28.34,Overweight,1


In [15]:
#count the number of ones which the number of errors 
number_of_ones = data['comparison'].sum()

print(f"There are {number_of_ones} errors.")


There are 665 errors.


In [16]:
# Filter the rows where the 'Comparison' column has the value 1, and select the desired columns
errors = data[data['comparison'] == 1][['comparison', 'BMI', 'NObeyesdad']]
print('This is a sample of errors')
# Display the first 5 rows of the errors DataFrame
errors.tail()
min_bmi_by_category = data.groupby('NObeyesdad')['BMI'].max()
print('Maximun values for each category')
print(min_bmi_by_category)
print('According to WHO BMI= 36.86 considered as Obesity type II but in the data is Obesity type III and BMI = 19.08 is consedered normal weight but its Insufficient_Weight in the dta')


This is a sample of errors
Maximun values for each category
NObeyesdad
Insufficient_Weight    19.08
Normal_Weight          24.91
Obesity_Type_I         35.17
Obesity_Type_II        39.79
Obesity_Type_III       50.81
Overweight             30.36
Name: BMI, dtype: float64
According to WHO BMI= 36.86 considered as Obesity type II but in the data is Obesity type III and BMI = 19.08 is consedered normal weight but its Insufficient_Weight in the dta


In [17]:
data['Classes'].unique()

array(['Normal_Weight', 'Overweight', 'Obesity_Type_I',
       'Insufficient_Weight', 'Obesity_Type_II', 'Obesity_Type_III'],
      dtype=object)

In [18]:
data['Classes'].value_counts()

Overweight             566
Obesity_Type_I         368
Obesity_Type_II        338
Normal_Weight          301
Insufficient_Weight    270
Obesity_Type_III       268
Name: Classes, dtype: int64

In [19]:
#create a new clean data 
#Drop the 'NObeyesdad' and 'Comparison' columns
data = data.drop(columns=['NObeyesdad', 'comparison'])

***Explore and Check if there are any errors on the columns***

***1.Number of main meals NCP (NCP is a categorical column and in the survey it has 3 categories)***


***How many main meals do you have daily?***

***Between 1 y 2***

***Three***

***More than three***

In [20]:
data['NCP'].dtypes

dtype('float64')

In [21]:
#Number of main meals NCP
data['NCP'].max()

4.0

In [22]:
data['NCP'].min()

1.0

In [23]:
#Chech how many unique values are there in NCP column, it should be 3 uniques values 
unique_categories = data['NCP'].unique()
unique_categories


array([3.      , 1.      , 4.      , 3.28926 , 3.995147, 1.72626 ,
       2.581015, 1.600812, 1.73762 , 1.10548 , 2.0846  , 1.894384,
       2.857787, 3.765526, 3.285167, 3.691226, 3.156153, 1.07976 ,
       3.559841, 3.891994, 3.240578, 3.904858, 3.11158 , 3.590039,
       2.057935, 3.558637, 2.000986, 3.821168, 3.897078, 3.092116,
       3.286431, 3.592415, 3.754599, 3.566082, 3.725797, 3.520555,
       3.731212, 1.259803, 1.273128, 3.304123, 3.647154, 3.300666,
       3.535016, 1.717608, 2.884479, 3.626815, 1.473088, 3.16645 ,
       3.494849, 2.99321 , 2.127797, 3.90779 , 3.699594, 3.179995,
       1.075553, 3.238258, 3.804944, 1.630846, 3.762778, 3.371832,
       2.705445, 3.34175 , 2.217651, 2.893778, 3.502604, 3.998766,
       3.193671, 1.69608 , 2.812377, 1.612747, 1.082304, 1.882158,
       2.326233, 1.989398, 1.735493, 2.974568, 3.715118, 3.489918,
       3.378859, 3.263201, 3.994588, 3.24934 , 3.087544, 1.163666,
       3.409363, 3.281391, 3.98525 , 3.207071, 3.471536, 3.488

***The NCP column needs to be corrected***

In [24]:
#the fucrion will round the numbers
def NCP_round(value):
    if 1 <= value <= 1.5:
        return 1
    elif 1.5 < value <= 2.5:
        return 2
    elif 2.5 < value <= 4:
        return 3
    else:
        return value  

data['NCP'] = data['NCP'].apply(NCP_round)


In [25]:
#Change the type of the column to object 
data['NCP'] = data['NCP'].astype('object')

In [26]:
data['NCP'].unique()

array([3, 1, 2], dtype=object)

2.***Frequency of consumption of vegetables FCVC (FCVC is a categorical column and in the survey it has 3 categories)***


***Do you usually eat vegetables in your meals?***

***Never***

***Sometimes***

***Always***

In [27]:
data['FCVC'].dtypes

dtype('float64')

In [28]:
data['FCVC'].min()

1.0

In [29]:
data['FCVC'].max()

3.0

In [30]:
#Fequency of consumption of vegetables
data['FCVC'].unique()

array([2.      , 3.      , 1.      , 2.450218, 2.880161, 2.00876 ,
       2.596579, 2.591439, 2.392665, 1.123939, 2.027574, 2.658112,
       2.88626 , 2.714447, 2.750715, 1.4925  , 2.205439, 2.059138,
       2.310423, 2.823179, 2.052932, 2.596364, 2.767731, 2.815157,
       2.737762, 2.568063, 2.524428, 2.971574, 1.0816  , 1.270448,
       1.344854, 2.959658, 2.725282, 2.844607, 2.44004 , 2.432302,
       2.592247, 2.449267, 2.929889, 2.015258, 1.031149, 1.592183,
       1.21498 , 1.522001, 2.703436, 2.362918, 2.14084 , 2.5596  ,
       2.336044, 1.813234, 2.724285, 2.71897 , 1.133844, 1.757466,
       2.979383, 2.204914, 2.927218, 2.88853 , 2.890535, 2.530066,
       2.241606, 1.003566, 2.652779, 2.897899, 2.483979, 2.945967,
       2.478891, 2.784464, 1.005578, 2.938031, 2.842102, 1.889199,
       2.943749, 2.33998 , 1.950742, 2.277436, 2.371338, 2.984425,
       2.977018, 2.663421, 2.753752, 2.318355, 2.594653, 2.886157,
       2.967853, 2.619835, 1.053534, 2.530233, 2.8813  , 2.824

***The FCVC column needs to be corrected***

In [31]:
# using rounding we round the numbers from 1 to 1.5 category 1, from 1.5 to 2.5 category 2 and from 2.5 to 3 category 3
def rounding(value):
    if 1 <= value <= 1.5:
        return 1
    elif 1.5 < value <= 2.5:
        return 2
    elif 2.5 < value <= 3:
        return 3
    else:
        return value 

data['FCVC'] = data['FCVC'].apply(rounding)

In [32]:
#Change the type of the column to object 
data['FCVC'] = data['FCVC'].astype('object')

In [33]:
data['FCVC'].unique()

array([2, 3, 1], dtype=object)

***3.Consumption of water daily (CH20)*** 

***How much water do you drink daily?***

***Less than a liter***

***Between 1 and 2 L***

***More than 2 L***

In [34]:
data['CH2O'].dtypes

dtype('float64')

In [35]:
#Consumption of water daily 
data['CH2O'].min()

1.0

In [36]:
data['CH2O'].max()

3.0

In [37]:
data['CH2O'].unique()

array([2.      , 3.      , 1.      , ..., 2.054193, 2.852339, 2.863513])

***CH20 needs to be corrected***

In [38]:
data['CH2O'] = data['CH2O'].apply(rounding)

In [39]:
#Change the type of the column to object 
data['CH2O'] = data['CH2O'].astype('object')

In [40]:
data['CH2O'].unique()

array([2, 3, 1], dtype=object)

***4-Physical activity frequency (FAF)***

***How often do you have physical activity?***

***I do not have***

***1 or 2 days***

***2 or 4 days***

***4 or 5 days***

In [41]:
data['FAF'].min()

0.0

In [42]:
data['FAF'].max()

3.0

In [43]:
#Physical activity frequency
data['FAF'].unique()

array([0.      , 3.      , 2.      , ..., 1.414209, 1.139107, 1.026452])

***FAF needs to be corrected***

In [44]:
def FAF_round(value):
    if 0 <= value < 1:
        return 0
    elif 1 <= value <= 1.5:
        return 1
    elif 1.5 < value <= 2:
        return 2
    elif 2 < value <= 3:
        return 3

data['FAF'] = data['FAF'].apply(FAF_round)

In [45]:
#Change the type of the column to object 
data['FAF'] = data['FAF'].astype('object')

In [46]:
data['FAF'].unique()

array([0, 3, 2, 1], dtype=object)

***5-Time using technology devices***

***How much time do you use technological devices such as cell phone, videogames, television, computer and others?***

***0–2 hours***

***3–5 hours***

***More than 5 hours***

In [47]:
#Time using technology devices 
data['TUE'].min()

0.0

In [48]:
data['TUE'].max()

2.0

In [49]:
data['TUE'].unique()

array([1.      , 0.      , 2.      , ..., 0.646288, 0.586035, 0.714137])

***TUE needs to be corrected***

In [50]:
def TUE_rounding(value):
    if 0 <= value < 1:
        return 0
    elif 1 <= value <= 1.5:
        return 1
    elif 1.5 < value <= 2:
        return 2
    else:
        return value 

data['TUE'] = data['TUE'].apply(TUE_rounding)

In [51]:
#Change the type of the column to object 
data['TUE'] = data['TUE'].astype('object')

In [52]:
data['TUE'].unique()

array([1, 0, 2], dtype=object)

***6-Transportation used MTRANS***

***Which transportation do you usually use?***

***Automobile***

***Motorbike***

***Bike***

***Public Transportation***

***Walking***

In [53]:
#Transportation used
data['MTRANS'].unique()

array(['Public_Transportation', 'Walking', 'Automobile', 'Motorbike',
       'Bike'], dtype=object)

***7-What is your gender?***


***Female***


***Male***

In [54]:
data['Gender'].unique()

array(['Female', 'Male'], dtype=object)

In [55]:
print('The column Gender doesnt have errors')

The column Gender doesnt have errors


***8-Has a family member suffered or suffers from overweight?***

***Yes***

***No***

In [56]:
data['family_history_with_overweight'].unique()

array(['yes', 'no'], dtype=object)

In [57]:
print('The column family_history_with_overweight doesnt have errors')

The column family_history_with_overweight doesnt have errors


***9-Frequent consumption of high caloric food FAVC***

***Yes***

***No***

In [58]:
#frequent consumption of high caloric food 
data['FAVC'].unique()

array(['no', 'yes'], dtype=object)

In [59]:
print('FAVC is clean')

FAVC is clean


***10-Consumption of food between meals CAEC***

***Do you eat any food between meals?***

***No***

***Sometimes***

***Frequently***

***Always***

In [60]:
#Consumption of food between meals 
data['CAEC'].unique()

array(['Sometimes', 'Frequently', 'Always', 'no'], dtype=object)

In [61]:
print('The column CAEC is clean')

The column CAEC is clean


***11-Smoke***

***Do you smoke?***

***Yes***

***No***

In [62]:
data['SMOKE'].unique()

array(['no', 'yes'], dtype=object)

In [63]:
print('SMOKE column is clean')

SMOKE column is clean


***12-Calories consumption monitoring (SCC)***

***Do you eat high caloric food frequently?***

***Yes***

***No***

In [64]:
#Calories consumption monitoring
data['SCC'].unique()

array(['no', 'yes'], dtype=object)

In [65]:
print('SCC column is clean')

SCC column is clean


***13-Consumption of alcohol (CALC)***

***how often do you drink alcohol?***

***I do not drink***

***Sometimes***

***Frequently***

***Always***

In [66]:
#Consumption of alcohol
data['CALC'].unique()

array(['no', 'Sometimes', 'Frequently', 'Always'], dtype=object)

In [67]:
print('CALC is a clean column')

CALC is a clean column


***14-Height***

***what is your height?*** 

***Numeric value in meters***

In [68]:
data['Height'].min()

1.45

In [69]:
data['Height'].max()

1.98

In [70]:
data['Height'].dtypes

dtype('float64')

In [71]:
print('The Height is entered in meter and dosent have any issue doesnt need to be converted')

The Height is entered in meter and dosent have any issue doesnt need to be converted


***15-Weight***

***what is your weight?***

***Numeric value in kilograms***

In [72]:
data['Weight'].max()

173.0

In [73]:
data['Weight'].min()

39.0

In [74]:
data['Weight'].dtypes

dtype('float64')

In [75]:
print('Weight column dosent have any issue')

Weight column dosent have any issue


***16-Age***

***what is your age?***

***Numeric value***

In [76]:
data['Age'].dtypes

dtype('float64')

In [77]:
data['Age'].min()

14.0

In [78]:
data['Age'].max()

61.0

In [79]:
print('The Age column is clean and it matches the paper that mentions the ages of the participants are between 14 and 61')

The Age column is clean and it matches the paper that mentions the ages of the participants are between 14 and 61


***The columns MTRANS, Gender, Age, Height, Weight, family_history_with_overweight, CAEC, SMOKE, CALC are clean and dont have any iassues.***

In [80]:
data.head()

Unnamed: 0,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,BMI,Classes
0,Female,21.0,1.62,64.0,yes,no,2,3,Sometimes,no,2,no,0,1,no,Public_Transportation,24.39,Normal_Weight
1,Female,21.0,1.52,56.0,yes,no,3,3,Sometimes,yes,3,yes,3,0,Sometimes,Public_Transportation,24.24,Normal_Weight
2,Male,23.0,1.8,77.0,yes,no,2,3,Sometimes,no,2,no,2,1,Frequently,Public_Transportation,23.77,Normal_Weight
3,Male,27.0,1.8,87.0,no,no,3,3,Sometimes,no,2,no,2,0,Frequently,Walking,26.85,Overweight
4,Male,22.0,1.78,89.8,no,no,2,1,Sometimes,no,2,no,0,0,Sometimes,Public_Transportation,28.34,Overweight


In [81]:
# Save the new DataFrame to a CSV file
data.to_csv('modified_data.csv', index=False)

In [82]:
df=pd.read_csv('modified_data.csv')
df.head()

Unnamed: 0,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,BMI,Classes
0,Female,21.0,1.62,64.0,yes,no,2,3,Sometimes,no,2,no,0,1,no,Public_Transportation,24.39,Normal_Weight
1,Female,21.0,1.52,56.0,yes,no,3,3,Sometimes,yes,3,yes,3,0,Sometimes,Public_Transportation,24.24,Normal_Weight
2,Male,23.0,1.8,77.0,yes,no,2,3,Sometimes,no,2,no,2,1,Frequently,Public_Transportation,23.77,Normal_Weight
3,Male,27.0,1.8,87.0,no,no,3,3,Sometimes,no,2,no,2,0,Frequently,Walking,26.85,Overweight
4,Male,22.0,1.78,89.8,no,no,2,1,Sometimes,no,2,no,0,0,Sometimes,Public_Transportation,28.34,Overweight


In [83]:
#Check that there isnt any errors on the modified data 
# Group by the 'Category' column and get the minimum BMI for each category
min_bmi_by_category = df.groupby('Classes')['BMI'].max()

print(min_bmi_by_category)

# Group by the 'Category' column and get the minimum BMI for each category
min_bmi_by_category = df.groupby('Classes')['BMI'].min()

print(min_bmi_by_category)

Classes
Insufficient_Weight    18.49
Normal_Weight          24.95
Obesity_Type_I         34.95
Obesity_Type_II        39.94
Obesity_Type_III       50.81
Overweight             29.99
Name: BMI, dtype: float64
Classes
Insufficient_Weight    13.00
Normal_Weight          18.50
Obesity_Type_I         30.02
Obesity_Type_II        35.00
Obesity_Type_III       40.01
Overweight             25.00
Name: BMI, dtype: float64


In [84]:
df['Classes'].unique()

array(['Normal_Weight', 'Overweight', 'Obesity_Type_I',
       'Insufficient_Weight', 'Obesity_Type_II', 'Obesity_Type_III'],
      dtype=object)

In [85]:
data['NCP'].unique()

array([3, 1, 2], dtype=object)

In [86]:
df['NCP'].unique()

array([3, 1, 2], dtype=int64)

In [87]:
df['FCVC'].unique()

array([2, 3, 1], dtype=int64)

In [88]:
df['CH2O'].unique()

array([2, 3, 1], dtype=int64)

In [89]:
df['FAF'].unique()

array([0, 3, 2, 1], dtype=int64)

In [90]:
df['TUE'].unique()


array([1, 0, 2], dtype=int64)

In [91]:
new_data=pd.read_csv('tableau_file.csv')
new_data.head()

Unnamed: 0,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,BMI,Classes
0,Female,21.0,1.62,64.0,yes,no,Sometimes,More than three,Sometimes,no,Between 1 and 2 L,no,I do not have,3–5 hours,no,Public_Transportation,24.39,Normal_Weight
1,Female,21.0,1.52,56.0,yes,no,Always,More than three,Sometimes,yes,More than 2 L,yes,4 or 5 days,0–2 hours,Sometimes,Public_Transportation,24.24,Normal_Weight
2,Male,23.0,1.8,77.0,yes,no,Sometimes,More than three,Sometimes,no,Between 1 and 2 L,no,2 or 4 days,3–5 hours,Frequently,Public_Transportation,23.77,Normal_Weight
3,Male,27.0,1.8,87.0,no,no,Always,More than three,Sometimes,no,Between 1 and 2 L,no,2 or 4 days,0–2 hours,Frequently,Walking,26.85,Overweight
4,Male,22.0,1.78,89.8,no,no,Sometimes,Between 1 and 2,Sometimes,no,Between 1 and 2 L,no,I do not have,0–2 hours,Sometimes,Public_Transportation,28.34,Overweight


In [92]:
new_data['FCVC'].unique()

array(['Sometimes', 'Always', 'Never'], dtype=object)

In [93]:
new_data['FCVC'].value_counts()

Sometimes    1013
Always        996
Never         102
Name: FCVC, dtype: int64

In [94]:
new_data['NCP'].value_counts()

More than three    1619
Between 1 and 2     316
Three               176
Name: NCP, dtype: int64

In [95]:
new_data['FAVC'].value_counts()

yes    1866
no      245
Name: FAVC, dtype: int64

In [96]:
new_data['CH2O'].value_counts()

Between 1 and 2 L    1110
More than 2 L         516
Less than a liter     485
Name: CH2O, dtype: int64

In [97]:
new_data['CALC'].value_counts()

Sometimes     1401
no             639
Frequently      70
Always           1
Name: CALC, dtype: int64

In [98]:
new_data['SCC'].value_counts()

no     2015
yes      96
Name: SCC, dtype: int64

In [99]:
new_data['FAF'].value_counts()

I do not have    1011
1 or 2 days       485
2 or 4 days       422
4 or 5 days       193
Name: FAF, dtype: int64

In [100]:
new_data['MTRANS'].value_counts()

Public_Transportation    1580
Automobile                457
Walking                    56
Motorbike                  11
Bike                        7
Name: MTRANS, dtype: int64

In [101]:
new_data['SMOKE'].value_counts()

no     2067
yes      44
Name: SMOKE, dtype: int64

In [102]:
new_data['Gender'].value_counts()

Male      1068
Female    1043
Name: Gender, dtype: int64

In [103]:
new_data['family_history_with_overweight'].value_counts()

yes    1726
no      385
Name: family_history_with_overweight, dtype: int64

In [104]:
new_data['CAEC'].value_counts()

Sometimes     1765
Frequently     242
Always          53
no              51
Name: CAEC, dtype: int64