# Bellabeat: estudo de caso

## Sobre os dados:

* Os dados utilizados são os dados do fitbit disponível em https://www.kaggle.com/datasets/arashnic/fitbit

* O conjunto de dados possui os dados de cerca de 30 participantes que responderam uma pesquisa, consentindo com o compartilhamento dos dados de seus dispositivos Fitbit.

* O tamanho da amostra dos dados é de apenas 33 participantes. É o suficiente para iniciarmos alguma análise, mas o ideal seria um maior número amostral. Utilizar os dados internos da Bellabeat teria ajudado muito nesse quesito.

* Além desse conjunto de dados, também utilizarei algumas informações adicionais:
*"Smartwatch unit shipment share worldwide in first quarter of 2021, by platform"*, por Counterpoint Research (via Statista)
*"Quarterly smartwatch unit shipment share worldwide from 2018 to 2022, by vendor"*, por Counterpoint Research (via Statista)

Esses dois conjuntos de dados já estão preparados na sua fonte, então não é preciso fazer nenhum tipo de tratamento.

## Limpeza e Transformação dos Dados

In [1]:
# Se não possuir alguma das bibliotecas, basta instalá-la, removendo o asterisco (#) do código a ser executado.
# !pip install numpy
# !pip install pandas
# !pip install summarytools

In [2]:
import numpy as np
import pandas as pd
import summarytools as dfs

daily_activity = pd.read_csv(r"""C:\Users\mathe\Desktop\Power BI\Bellabeat (Google Analytics)\datasets\Fitabase Data 4.12.16-5.12.16\dailyActivity_merged.csv""", sep=",")

In [3]:
dfs.dfSummary(daily_activity)

No,Variable,Stats / Values,Freqs / (% of Valid),Graph,Missing
1,Id [int64],Mean (sd) : 4855407369.3 (2424805475.7) min < med < max: 1503960366.0 < 4445114986.0 < 8877689391.0 IQR (CV) : 4642054065.0 (2.0),33 distinct values,,0 (0.0%)
2,ActivityDate [object],1. 4/12/2016 2. 4/14/2016 3. 4/15/2016 4. 4/13/2016 5. 4/23/2016 6. 4/29/2016 7. 4/28/2016 8. 4/26/2016 9. 4/25/2016 10. 4/24/2016 11. other,33 (3.5%) 33 (3.5%) 33 (3.5%) 33 (3.5%) 32 (3.4%) 32 (3.4%) 32 (3.4%) 32 (3.4%) 32 (3.4%) 32 (3.4%) 616 (65.5%),,0 (0.0%)
3,TotalSteps [int64],Mean (sd) : 7637.9 (5087.2) min < med < max: 0.0 < 7405.5 < 36019.0 IQR (CV) : 6937.2 (1.5),842 distinct values,,0 (0.0%)
4,TotalDistance [float64],Mean (sd) : 5.5 (3.9) min < med < max: 0.0 < 5.2 < 28.0 IQR (CV) : 5.1 (1.4),615 distinct values,,0 (0.0%)
5,TrackerDistance [float64],Mean (sd) : 5.5 (3.9) min < med < max: 0.0 < 5.2 < 28.0 IQR (CV) : 5.1 (1.4),613 distinct values,,0 (0.0%)
6,LoggedActivitiesDistance [float64],Mean (sd) : 0.1 (0.6) min < med < max: 0.0 < 0.0 < 4.9 IQR (CV) : 0.0 (0.2),19 distinct values,,0 (0.0%)
7,VeryActiveDistance [float64],Mean (sd) : 1.5 (2.7) min < med < max: 0.0 < 0.2 < 21.9 IQR (CV) : 2.1 (0.6),333 distinct values,,0 (0.0%)
8,ModeratelyActiveDistance [float64],Mean (sd) : 0.6 (0.9) min < med < max: 0.0 < 0.2 < 6.5 IQR (CV) : 0.8 (0.6),211 distinct values,,0 (0.0%)
9,LightActiveDistance [float64],Mean (sd) : 3.3 (2.0) min < med < max: 0.0 < 3.4 < 10.7 IQR (CV) : 2.8 (1.6),491 distinct values,,0 (0.0%)
10,SedentaryActiveDistance [float64],1. 0.0 2. 0.0099999997764825 3. 0.0199999995529652 4. 0.0299999993294477 5. 0.0500000007450581 6. 0.0700000002980232 7. 0.0399999991059303 8. 0.109999999403954 9. 0.100000001490116,858 (91.3%) 50 (5.3%) 21 (2.2%) 4 (0.4%) 3 (0.3%) 1 (0.1%) 1 (0.1%) 1 (0.1%) 1 (0.1%),,0 (0.0%)


In [4]:
daily_calories = pd.read_csv(r"""C:\Users\mathe\Desktop\Power BI\Bellabeat (Google Analytics)\datasets\Fitabase Data 4.12.16-5.12.16\dailyCalories_merged.csv""")
dfs.dfSummary(daily_calories)

No,Variable,Stats / Values,Freqs / (% of Valid),Graph,Missing
1,Id [int64],Mean (sd) : 4855407369.3 (2424805475.7) min < med < max: 1503960366.0 < 4445114986.0 < 8877689391.0 IQR (CV) : 4642054065.0 (2.0),33 distinct values,,0 (0.0%)
2,ActivityDay [object],1. 4/12/2016 2. 4/14/2016 3. 4/15/2016 4. 4/13/2016 5. 4/23/2016 6. 4/29/2016 7. 4/28/2016 8. 4/26/2016 9. 4/25/2016 10. 4/24/2016 11. other,33 (3.5%) 33 (3.5%) 33 (3.5%) 33 (3.5%) 32 (3.4%) 32 (3.4%) 32 (3.4%) 32 (3.4%) 32 (3.4%) 32 (3.4%) 616 (65.5%),,0 (0.0%)
3,Calories [int64],Mean (sd) : 2303.6 (718.2) min < med < max: 0.0 < 2134.0 < 4900.0 IQR (CV) : 964.8 (3.2),734 distinct values,,0 (0.0%)


In [5]:
daily_intensities = pd.read_csv(r"""C:\Users\mathe\Desktop\Power BI\Bellabeat (Google Analytics)\datasets\Fitabase Data 4.12.16-5.12.16\dailyIntensities_merged.csv""")
dfs.dfSummary(daily_intensities)

No,Variable,Stats / Values,Freqs / (% of Valid),Graph,Missing
1,Id [int64],Mean (sd) : 4855407369.3 (2424805475.7) min < med < max: 1503960366.0 < 4445114986.0 < 8877689391.0 IQR (CV) : 4642054065.0 (2.0),33 distinct values,,0 (0.0%)
2,ActivityDay [object],1. 4/12/2016 2. 4/14/2016 3. 4/15/2016 4. 4/13/2016 5. 4/23/2016 6. 4/29/2016 7. 4/28/2016 8. 4/26/2016 9. 4/25/2016 10. 4/24/2016 11. other,33 (3.5%) 33 (3.5%) 33 (3.5%) 33 (3.5%) 32 (3.4%) 32 (3.4%) 32 (3.4%) 32 (3.4%) 32 (3.4%) 32 (3.4%) 616 (65.5%),,0 (0.0%)
3,SedentaryMinutes [int64],Mean (sd) : 991.2 (301.3) min < med < max: 0.0 < 1057.5 < 1440.0 IQR (CV) : 499.8 (3.3),549 distinct values,,0 (0.0%)
4,LightlyActiveMinutes [int64],Mean (sd) : 192.8 (109.2) min < med < max: 0.0 < 199.0 < 518.0 IQR (CV) : 137.0 (1.8),335 distinct values,,0 (0.0%)
5,FairlyActiveMinutes [int64],Mean (sd) : 13.6 (20.0) min < med < max: 0.0 < 6.0 < 143.0 IQR (CV) : 19.0 (0.7),81 distinct values,,0 (0.0%)
6,VeryActiveMinutes [int64],Mean (sd) : 21.2 (32.8) min < med < max: 0.0 < 4.0 < 210.0 IQR (CV) : 32.0 (0.6),122 distinct values,,0 (0.0%)
7,SedentaryActiveDistance [float64],1. 0.0 2. 0.0099999997764825 3. 0.0199999995529652 4. 0.0299999993294477 5. 0.0500000007450581 6. 0.0700000002980232 7. 0.0399999991059303 8. 0.109999999403954 9. 0.100000001490116,858 (91.3%) 50 (5.3%) 21 (2.2%) 4 (0.4%) 3 (0.3%) 1 (0.1%) 1 (0.1%) 1 (0.1%) 1 (0.1%),,0 (0.0%)
8,LightActiveDistance [float64],Mean (sd) : 3.3 (2.0) min < med < max: 0.0 < 3.4 < 10.7 IQR (CV) : 2.8 (1.6),491 distinct values,,0 (0.0%)
9,ModeratelyActiveDistance [float64],Mean (sd) : 0.6 (0.9) min < med < max: 0.0 < 0.2 < 6.5 IQR (CV) : 0.8 (0.6),211 distinct values,,0 (0.0%)
10,VeryActiveDistance [float64],Mean (sd) : 1.5 (2.7) min < med < max: 0.0 < 0.2 < 21.9 IQR (CV) : 2.1 (0.6),333 distinct values,,0 (0.0%)


In [6]:
daily_steps = pd.read_csv(r"""C:\Users\mathe\Desktop\Power BI\Bellabeat (Google Analytics)\datasets\Fitabase Data 4.12.16-5.12.16\dailySteps_merged.csv""")
dfs.dfSummary(daily_steps)

No,Variable,Stats / Values,Freqs / (% of Valid),Graph,Missing
1,Id [int64],Mean (sd) : 4855407369.3 (2424805475.7) min < med < max: 1503960366.0 < 4445114986.0 < 8877689391.0 IQR (CV) : 4642054065.0 (2.0),33 distinct values,,0 (0.0%)
2,ActivityDay [object],1. 4/12/2016 2. 4/14/2016 3. 4/15/2016 4. 4/13/2016 5. 4/23/2016 6. 4/29/2016 7. 4/28/2016 8. 4/26/2016 9. 4/25/2016 10. 4/24/2016 11. other,33 (3.5%) 33 (3.5%) 33 (3.5%) 33 (3.5%) 32 (3.4%) 32 (3.4%) 32 (3.4%) 32 (3.4%) 32 (3.4%) 32 (3.4%) 616 (65.5%),,0 (0.0%)
3,StepTotal [int64],Mean (sd) : 7637.9 (5087.2) min < med < max: 0.0 < 7405.5 < 36019.0 IQR (CV) : 6937.2 (1.5),842 distinct values,,0 (0.0%)


In [7]:
daily_sleep = pd.read_csv(r"""C:\Users\mathe\Desktop\Power BI\Bellabeat (Google Analytics)\datasets\Fitabase Data 4.12.16-5.12.16\sleepDay_merged.csv""")
dfs.dfSummary(daily_sleep)

No,Variable,Stats / Values,Freqs / (% of Valid),Graph,Missing
1,Id [int64],Mean (sd) : 5000979403.2 (2060360173.7) min < med < max: 1503960366.0 < 4702921684.0 < 8792009665.0 IQR (CV) : 2984847353.0 (2.4),24 distinct values,,0 (0.0%)
2,SleepDay [object],1. 4/15/2016 12:00:00 AM 2. 5/1/2016 12:00:00 AM 3. 4/28/2016 12:00:00 AM 4. 4/30/2016 12:00:00 AM 5. 4/20/2016 12:00:00 AM 6. 4/21/2016 12:00:00 AM 7. 4/23/2016 12:00:00 AM 8. 4/29/2016 12:00:00 AM 9. 4/26/2016 12:00:00 AM 10. 5/8/2016 12:00:00 AM 11. other,17 (4.1%) 16 (3.9%) 16 (3.9%) 15 (3.6%) 15 (3.6%) 15 (3.6%) 15 (3.6%) 15 (3.6%) 14 (3.4%) 14 (3.4%) 261 (63.2%),,0 (0.0%)
3,TotalSleepRecords [int64],Mean (sd) : 1.1 (0.3) min < med < max: 1.0 < 1.0 < 3.0 IQR (CV) : 0.0 (3.2),3 distinct values,,0 (0.0%)
4,TotalMinutesAsleep [int64],Mean (sd) : 419.5 (118.3) min < med < max: 58.0 < 433.0 < 796.0 IQR (CV) : 129.0 (3.5),256 distinct values,,0 (0.0%)
5,TotalTimeInBed [int64],Mean (sd) : 458.6 (127.1) min < med < max: 61.0 < 463.0 < 961.0 IQR (CV) : 123.0 (3.6),242 distinct values,,0 (0.0%)


In [9]:
## Convertendo datas

daily_intensities["ActivityDay"] = pd.to_datetime(daily_intensities["ActivityDay"]).dt.date
daily_activity["ActivityDate"] = pd.to_datetime(daily_activity["ActivityDate"]).dt.date
daily_calories["ActivityDay"] = pd.to_datetime(daily_calories["ActivityDay"]).dt.date
daily_steps["ActivityDay"] = pd.to_datetime(daily_steps["ActivityDay"]).dt.date
daily_sleep["SleepDay"] = pd.to_datetime(daily_sleep["SleepDay"]).dt.date

In [11]:
dfs.dfSummary(daily_steps)

No,Variable,Stats / Values,Freqs / (% of Valid),Graph,Missing
1,Id [int64],Mean (sd) : 4855407369.3 (2424805475.7) min < med < max: 1503960366.0 < 4445114986.0 < 8877689391.0 IQR (CV) : 4642054065.0 (2.0),33 distinct values,,0 (0.0%)
2,ActivityDay [object],1. 2016-04-12 2. 2016-04-14 3. 2016-04-15 4. 2016-04-13 5. 2016-04-23 6. 2016-04-29 7. 2016-04-28 8. 2016-04-26 9. 2016-04-25 10. 2016-04-24 11. other,33 (3.5%) 33 (3.5%) 33 (3.5%) 33 (3.5%) 32 (3.4%) 32 (3.4%) 32 (3.4%) 32 (3.4%) 32 (3.4%) 32 (3.4%) 616 (65.5%),,0 (0.0%)
3,StepTotal [int64],Mean (sd) : 7637.9 (5087.2) min < med < max: 0.0 < 7405.5 < 36019.0 IQR (CV) : 6937.2 (1.5),842 distinct values,,0 (0.0%)


In [12]:
daily_intensities.to_csv("CLEAN_daily_intensities.csv")
daily_activity.to_csv("CLEAN_daily_activity")
daily_calories.to_csv("CLEAN_daily_calories")
daily_steps.to_csv("CLEAN_daily_steps")
daily_sleep.to_csv("CLEAN_daily_sleep")