üìò CP1 IOT - An√°lise de Consumo de Energia
Este notebook Jupytr organiza a resolu√ß√£o completa das tarefas do CP1 de IoT. Ele usa dois conjuntos de dados para demonstrar habilidades em ci√™ncia de dados: o Individual Household Electric Power Consumption e o Appliances Energy Prediction.

As principais etapas e t√©cnicas usadas neste projeto incluem:

Prepara√ß√£o e Limpeza de Dados: Trata dados ausentes, converte formatos e prepara as informa√ß√µes para an√°lise.

An√°lise e Visualiza√ß√£o: Explora os dados com gr√°ficos, estat√≠sticas e decomposi√ß√£o de s√©ries temporais para entender padr√µes de consumo.

Modelagem Preditiva: Aplica modelos de aprendizado de m√°quina, como regress√£o linear e random forest, para prever o consumo de energia.

Clustering: Usa a t√©cnica K-Means para agrupar e identificar diferentes perfis de consumo.

An√°lise com Ferramentas Visuais: Inclui respostas e conclus√µes de exerc√≠cios feitos com a ferramenta de minera√ß√£o de dados Orange Data Mining.



In [None]:
#CP1 IOT

#PARTE 1 ‚Äì Exerc√≠cios iniciais com Individual Household Electric Power Consumption

## Quest√£o 1

In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

import warnings
warnings.filterwarnings('ignore')

## Quest√£o 2

In [None]:

#Global_active_power: √© a pot√™ncia ativa consumida (energia efetivamente usada pelos aparelhos) e Global_reactive_power: √© a pot√™ncia reativa, associada a campos magn√©ticos
#(como em motores, transformadores). N√£o realiza trabalho √∫til, mas circula na rede.

## Quest√£o 3

In [None]:

missing = df.isnull().sum()
print("Valores ausentes por coluna:")
print(missing)
print("Total de valores ausentes:", missing.sum())

## Quest√£o 4

In [None]:

df["Date"] = pd.to_datetime(df["Date"], format="%d/%m/%Y")
df["Weekday"] = df["Date"].dt.day_name()

print(df[["Date", "Weekday"]].head())

## Quest√£o 5

In [None]:

df_2007 = df[df["Date"].dt.year == 2007]

df_2007["Global_active_power"] = pd.to_numeric(df_2007["Global_active_power"], errors="coerce")

daily_mean = df_2007.groupby(df_2007["Date"].dt.date)["Global_active_power"].mean()

print("M√©dia de consumo di√°rio em 2007:")
print(daily_mean.head())

## Quest√£o 6

In [None]:
import matplotlib.pyplot as plt

one_day = df[df["Date"] == "2007-01-10"].copy()

one_day["Global_active_power"] = pd.to_numeric(one_day["Global_active_power"], errors="coerce")

plt.figure(figsize=(12,5))
plt.plot(one_day["Global_active_power"])
plt.title("Varia√ß√£o de Global Active Power em 10/01/2007")
plt.xlabel("Registros ao longo do dia")
plt.ylabel("Global Active Power (kW)")
plt.show()

## Quest√£o 7

In [None]:

df["Voltage"] = pd.to_numeric(df["Voltage"], errors="coerce")

plt.figure(figsize=(8,5))
plt.hist(df["Voltage"].dropna(), bins=50, color="skyblue", edgecolor="black")
plt.title("Distribui√ß√£o da vari√°vel Voltage")
plt.xlabel("Voltage (V)")
plt.ylabel("Frequ√™ncia")
plt.show()

## Quest√£o 8

In [None]:

df["Global_active_power"] = pd.to_numeric(df["Global_active_power"], errors="coerce")

monthly_mean = df.groupby([df["Date"].dt.to_period("M")])["Global_active_power"].mean()

print("Consumo m√©dio por m√™s:")
print(monthly_mean)

## Quest√£o 9

In [None]:

daily_sum = df.groupby(df["Date"].dt.date)["Global_active_power"].sum()

max_day = daily_sum.idxmax()
max_value = daily_sum.max()

print(f"Dia de maior consumo: {max_day} com {max_value} kW")

## Quest√£o 10

In [None]:

df["is_weekend"] = df["Weekday"].isin(["Saturday", "Sunday"])

week_comparison = df.groupby("is_weekend")["Global_active_power"].mean()

print("Consumo m√©dio - Dias de semana vs Finais de semana:")
print(week_comparison)

## Quest√£o 11

In [None]:

for col in ["Global_active_power", "Global_reactive_power", "Voltage", "Global_intensity"]:
    df[col] = pd.to_numeric(df[col], errors="coerce")

correlation = df[["Global_active_power", "Global_reactive_power", "Voltage", "Global_intensity"]].corr()

print("Matriz de correla√ß√£o:")
print(correlation)

## Quest√£o 12

In [None]:

for col in ["Sub_metering_1", "Sub_metering_2", "Sub_metering_3"]:
    df[col] = pd.to_numeric(df[col], errors="coerce")

df["Total_Sub_metering"] = df["Sub_metering_1"] + df["Sub_metering_2"] + df["Sub_metering_3"]

print(df[["Sub_metering_1", "Sub_metering_2", "Sub_metering_3", "Total_Sub_metering"]].head())

## Quest√£o 13

In [None]:

monthly_total = df.groupby(df["Date"].dt.to_period("M"))["Total_Sub_metering"].mean()
monthly_global = df.groupby(df["Date"].dt.to_period("M"))["Global_active_power"].mean()

comparison = monthly_total > monthly_global

print("Meses em que Total_Sub_metering > Global_active_power:")
print(comparison[comparison == True])

## Quest√£o 14

In [None]:

voltage_2008 = df[df["Date"].dt.year == 2008]

plt.figure(figsize=(12,5))
plt.plot(voltage_2008["Date"], voltage_2008["Voltage"], color="orange")
plt.title("S√©rie Temporal do Voltage em 2008")
plt.xlabel("Data")
plt.ylabel("Voltage (V)")
plt.show()

## Quest√£o 15

In [None]:

df["Month"] = df["Date"].dt.month

summer = df[df["Month"].isin([6,7,8])]
winter = df[df["Month"].isin([12,1,2])]

summer_mean = summer["Global_active_power"].mean()
winter_mean = winter["Global_active_power"].mean()

print("M√©dia consumo ver√£o:", summer_mean)
print("M√©dia consumo inverno:", winter_mean)

## Quest√£o 16

In [None]:
sample = df.sample(frac=0.01, random_state=42)

plt.figure(figsize=(12,5))

plt.hist(df["Global_active_power"].dropna(), bins=50, alpha=0.5, label="Base completa")

plt.hist(sample["Global_active_power"].dropna(), bins=50, alpha=0.5, label="Amostra 1%")

plt.title("Distribui√ß√£o Global Active Power - Base Completa vs Amostra 1%")
plt.xlabel("Global Active Power (kW)")
plt.ylabel("Frequ√™ncia")
plt.legend()
plt.show()

## Quest√£o 17

In [None]:

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

cols = ["Global_active_power", "Global_reactive_power", "Voltage", "Global_intensity"]
df_scaled = df.copy()
df_scaled[cols] = scaler.fit_transform(df[cols])

print(df_scaled[cols].head())

## Quest√£o 18

In [None]:

from sklearn.cluster import KMeans

daily_data = df.groupby(df["Date"].dt.date)["Global_active_power"].mean().dropna().values.reshape(-1,1)

kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(daily_data)

print("Cluster centers:", kmeans.cluster_centers_)

import numpy as np
unique, counts = np.unique(labels, return_counts=True)
print("Distribui√ß√£o de dias por cluster:", dict(zip(unique, counts)))

## Quest√£o 19

In [None]:

from statsmodels.tsa.seasonal import seasonal_decompose

six_months = df[(df["Date"] >= "2007-01-01") & (df["Date"] <= "2007-06-30")]

series = six_months.groupby("Date")["Global_active_power"].mean()

decomposition = seasonal_decompose(series, model="additive", period=30)
decomposition.plot()
plt.show()

# uest√£o 20

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

X = df[["Global_intensity"]].dropna()
y = df["Global_active_power"].dropna()

valid = X.index.intersection(y.index)
X = X.loc[valid]
y = y.loc[valid]

model = LinearRegression()
model.fit(X, y)

y_pred = model.predict(X)

mse = mean_squared_error(y, y_pred)
print("Coeficiente angular:", model.coef_[0])
print("Intercepto:", model.intercept_)
print("Erro quadr√°tico m√©dio (MSE):", mse)

#PARTE 2 ‚Äì Exerc√≠cios adicionais no dataset inicial

## Quest√£o 21

In [None]:

df["Datetime"] = pd.to_datetime(df["Date"].astype(str) + " " + df["Time"], errors="coerce")

df.set_index("Datetime", inplace=True)

hourly = df["Global_active_power"].resample("H").mean()

print("Consumo m√©dio hor√°rio:")
print(hourly.head())

print("Top 5 hor√°rios de maior consumo:")
print(hourly.groupby(hourly.index.hour).mean().sort_values(ascending=False).head())

## Quest√£o 22

In [None]:

from pandas.plotting import autocorrelation_plot

autocorrelation_plot(hourly.dropna())
plt.show()

lag_1h = hourly.autocorr(lag=1)
lag_24h = hourly.autocorr(lag=24)
lag_48h = hourly.autocorr(lag=48)

print("Autocorrela√ß√£o 1h:", lag_1h)
print("Autocorrela√ß√£o 24h:", lag_24h)
print("Autocorrela√ß√£o 48h:", lag_48h)

## Quest√£o 23

In [None]:

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

features = ["Global_active_power", "Global_reactive_power", "Voltage", "Global_intensity"]
X = df[features].dropna()

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

print("Vari√¢ncia explicada por cada componente:", pca.explained_variance_ratio_)
print("Vari√¢ncia total explicada:", pca.explained_variance_ratio_.sum())

## Quest√£o 24

In [None]:

kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X_pca)

plt.figure(figsize=(8,6))
plt.scatter(X_pca[:,0], X_pca[:,1], c=labels, cmap="viridis", alpha=0.5)
plt.title("Clusters no espa√ßo PCA (2 componentes)")
plt.xlabel("PCA 1")
plt.ylabel("PCA 2")
plt.colorbar(label="Cluster")
plt.show()

## Quest√£o 25

In [None]:

import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error

X = df[["Voltage"]].dropna()
y = df["Global_active_power"].dropna()

valid = X.index.intersection(y.index)
X = X.loc[valid]
y = y.loc[valid]

lin_reg = LinearRegression()
lin_reg.fit(X, y)
y_pred_lin = lin_reg.predict(X)

poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)

poly_reg = LinearRegression()
poly_reg.fit(X_poly, y)
y_pred_poly = poly_reg.predict(X_poly)

rmse_lin = mean_squared_error(y, y_pred_lin, squared=False)
rmse_poly = mean_squared_error(y, y_pred_poly, squared=False)

print("RMSE Linear:", rmse_lin)
print("RMSE Polinomial:", rmse_poly)

plt.figure(figsize=(8,5))
plt.scatter(X, y, s=10, alpha=0.3, label="Dados reais")
plt.plot(X, y_pred_lin, color="red", label="Regress√£o Linear")
plt.scatter(X, y_pred_poly, color="green", s=1, alpha=0.3, label="Regress√£o Polinomial (grau 2)")
plt.xlabel("Voltage (V)")
plt.ylabel("Global Active Power (kW)")
plt.legend()
plt.show()

#PARTE 3 ‚Äì Novo dataset Appliances Energy Prediction

## Quest√£o 26

In [None]:

df_app = pd.read_csv("energydata_complete.csv")


print(df_app.info())
print(df_app.describe())

## Quest√£o 27

In [None]:

plt.figure(figsize=(8,5))
plt.hist(df_app["Appliances"], bins=50, color="skyblue", edgecolor="black")
plt.title("Distribui√ß√£o do consumo de Appliances")
plt.xlabel("Consumo (Wh)")
plt.ylabel("Frequ√™ncia")
plt.show()

plt.figure(figsize=(12,5))
plt.plot(df_app["Appliances"][:500], color="orange")  
plt.title("Consumo de Appliances (exemplo)")
plt.xlabel("Tempo (registros)")
plt.ylabel("Consumo (Wh)")
plt.show()

## Quest√£o 28

In [None]:

corr = df_app.corr(numeric_only=True)["Appliances"].sort_values(ascending=False)

print("Correla√ß√£o de Appliances com outras vari√°veis:")
print(corr)

## Quest√£o 29

In [None]:

num_cols = df_app.select_dtypes(include=["float64","int64"]).columns

scaler = MinMaxScaler()
df_app_scaled = df_app.copy()
df_app_scaled[num_cols] = scaler.fit_transform(df_app[num_cols])

print(df_app_scaled[num_cols].head())

## Quest√£o 30

In [None]:

X = df_app_scaled[num_cols].drop(columns=["Appliances"])  # remover target

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

print("Vari√¢ncia explicada:", pca.explained_variance_ratio_)

plt.figure(figsize=(8,6))
plt.scatter(X_pca[:,0], X_pca[:,1], alpha=0.3, s=10, cmap="viridis")
plt.title("PCA - 2 Componentes")
plt.xlabel("Componente 1")
plt.ylabel("Componente 2")
plt.show()

## Quest√£o 31

In [None]:

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

X = df_app_scaled.drop(columns=["Appliances"])
y = df_app_scaled["Appliances"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
y_pred = lin_reg.predict(X_test)

print("R¬≤:", r2_score(y_test, y_pred))
print("RMSE:", mean_squared_error(y_test, y_pred, squared=False))

## Quest√£o 32

In [None]:

from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

y_pred_rf = rf.predict(X_test)

print("Random Forest - RMSE:", mean_squared_error(y_test, y_pred_rf, squared=False))
print("Random Forest - R¬≤:", r2_score(y_test, y_pred_rf))

## Quest√£o 33

In [None]:

X_cluster = df_app_scaled.drop(columns=["Appliances"])

kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X_cluster)

print("Centros dos clusters (3 grupos):")
print(kmeans.cluster_centers_)

unique, counts = np.unique(labels, return_counts=True)
print("Distribui√ß√£o:", dict(zip(unique, counts)))

## Quest√£o 34

In [None]:

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

median_val = df_app["Appliances"].median()
df_app["High_Consumption"] = (df_app["Appliances"] > median_val).astype(int)

X = df_app_scaled.drop(columns=["Appliances"])
y = df_app["High_Consumption"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train, y_train)
y_pred_log = log_reg.predict(X_test)

rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)
y_pred_rf = rf_clf.predict(X_test)

print("Logistic Regression - Acur√°cia:", log_reg.score(X_test, y_test))
print("Random Forest Classifier - Acur√°cia:", rf_clf.score(X_test, y_test))

## Quest√£o 35

In [None]:

from sklearn.metrics import confusion_matrix, classification_report

print("Matriz de confus√£o - Logistic Regression")
print(confusion_matrix(y_test, y_pred_log))
print(classification_report(y_test, y_pred_log))

print("Matriz de confus√£o - Random Forest Classifier")
print(confusion_matrix(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf))

#PARTE 4 - Exerc√≠cios no Orange Data Mining(Iremos comentar os passos a passos de cada a√ß√£o)

## Quest√£o 36

In [None]:

# Fiz a importa√ß√£o do dataset usando o widget CSV File Import no Orange.
# Para visualizar o conte√∫do, conectei o widget a um Data Table.
# A tabela mostrou vari√°veis como Date, Time, Global_active_power, entre outras.
# Percebi que o dataset √© bem grande, com milh√µes de registros, por ter dados de v√°rios anos.
# As vari√°veis num√©ricas foram identificadas sem problemas, mas as de data e hora podem precisar de ajustes.


## Quest√£o 37

In [None]:
# Utilizei o widget Sample Data para criar uma amostra de 1% dos registros.
# Apesar de pequena, essa amostra se mostrou representativa por ainda conter milhares de dados.
# Ao comparar a distribui√ß√£o de Global_active_power da amostra com a da base completa usando o widget Distribution, notei que o formato do histograma era muito parecido.
# Isso confirma que a amostragem preservou as caracter√≠sticas da distribui√ß√£o original, com a maioria dos valores concentrada em consumos baixos e alguns picos.

## Quest√£o 38

In [None]:
# Para essa an√°lise, usei o widget Distribution para criar um histograma do Global_active_power.
# O gr√°fico mostrou que a maioria dos valores de consumo se concentra na faixa baixa, abaixo de 2 kW.
# Observei que h√° uma cauda longa de valores, com alguns picos de consumo mais altos, mas em menor frequ√™ncia.
# Isso sugere que o consumo residencial √© geralmente moderado, com picos ocasionais em momentos de uso mais intenso de aparelhos.

## Quest√£o 39

In [None]:

# No widget Scatter Plot, configurei o Voltage no eixo X e a Global_intensity no eixo Y.
# O gr√°fico que apareceu mostrou uma correla√ß√£o positiva clara, onde um aumento na intensidade el√©trica geralmente corresponde a um aumento na tens√£o.
# Embora os pontos n√£o formem uma linha perfeitamente reta, a nuvem de dados sugere uma rela√ß√£o direta e vis√≠vel entre as duas vari√°veis.
# Essa rela√ß√£o faz sentido do ponto de vista da f√≠sica: uma maior intensidade de corrente est√° diretamente ligada a um maior consumo de energia.

## Quest√£o 40

In [None]:

# Para essa etapa, utilizei o widget k-Means, configurando-o para criar 3 clusters.
# Usei as vari√°veis Sub_metering_1, Sub_metering_2 e Sub_metering_3 como atributos para a clusteriza√ß√£o.
# Os resultados foram visualizados em um Scatter Plot, onde os pontos foram coloridos de acordo com o cluster a que pertenciam.
# A an√°lise revelou que cada grupo representa um padr√£o de consumo distinto, como um cluster com alto consumo na √°rea 1, outro na √°rea 2, e um terceiro com consumo mais distribu√≠do.
# Isso demonstra que a t√©cnica de K-Means foi eficaz em identificar perfis energ√©ticos variados.
