# üìò CP1 IOT - An√°lise de Consumo de Energia

Este notebook cont√©m a resolu√ß√£o organizada das quest√µes do CP1 de IoT, utilizando os datasets:
- **Individual Household Electric Power Consumption**
- **Appliances Energy Prediction**

As an√°lises incluem:
- Tratamento de dados
- Visualiza√ß√µes
- Estat√≠sticas
- Modelos de Machine Learning
- Exerc√≠cios com Orange Data Mining (comentados)


In [None]:
#CP1 IOT

#PARTE 1 ‚Äì Exerc√≠cios iniciais com Individual Household Electric Power Consumption

## Quest√£o 1

In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

import warnings
warnings.filterwarnings('ignore')

## Quest√£o 2

In [None]:

#Global_active_power: √© a pot√™ncia ativa consumida (energia efetivamente usada pelos aparelhos) e Global_reactive_power: √© a pot√™ncia reativa, associada a campos magn√©ticos
#(como em motores, transformadores). N√£o realiza trabalho √∫til, mas circula na rede.

## Quest√£o 3

In [None]:

missing = df.isnull().sum()
print("Valores ausentes por coluna:")
print(missing)
print("Total de valores ausentes:", missing.sum())

## Quest√£o 4

In [None]:

df["Date"] = pd.to_datetime(df["Date"], format="%d/%m/%Y")
df["Weekday"] = df["Date"].dt.day_name()

print(df[["Date", "Weekday"]].head())

## Quest√£o 5

In [None]:

df_2007 = df[df["Date"].dt.year == 2007]

df_2007["Global_active_power"] = pd.to_numeric(df_2007["Global_active_power"], errors="coerce")

daily_mean = df_2007.groupby(df_2007["Date"].dt.date)["Global_active_power"].mean()

print("M√©dia de consumo di√°rio em 2007:")
print(daily_mean.head())

## Quest√£o 6

In [None]:
import matplotlib.pyplot as plt

one_day = df[df["Date"] == "2007-01-10"].copy()

one_day["Global_active_power"] = pd.to_numeric(one_day["Global_active_power"], errors="coerce")

plt.figure(figsize=(12,5))
plt.plot(one_day["Global_active_power"])
plt.title("Varia√ß√£o de Global Active Power em 10/01/2007")
plt.xlabel("Registros ao longo do dia")
plt.ylabel("Global Active Power (kW)")
plt.show()

## Quest√£o 7

In [None]:

df["Voltage"] = pd.to_numeric(df["Voltage"], errors="coerce")

plt.figure(figsize=(8,5))
plt.hist(df["Voltage"].dropna(), bins=50, color="skyblue", edgecolor="black")
plt.title("Distribui√ß√£o da vari√°vel Voltage")
plt.xlabel("Voltage (V)")
plt.ylabel("Frequ√™ncia")
plt.show()

## Quest√£o 8

In [None]:

df["Global_active_power"] = pd.to_numeric(df["Global_active_power"], errors="coerce")

monthly_mean = df.groupby([df["Date"].dt.to_period("M")])["Global_active_power"].mean()

print("Consumo m√©dio por m√™s:")
print(monthly_mean)

## Quest√£o 9

In [None]:

daily_sum = df.groupby(df["Date"].dt.date)["Global_active_power"].sum()

max_day = daily_sum.idxmax()
max_value = daily_sum.max()

print(f"Dia de maior consumo: {max_day} com {max_value} kW")

## Quest√£o 10

In [None]:

df["is_weekend"] = df["Weekday"].isin(["Saturday", "Sunday"])

week_comparison = df.groupby("is_weekend")["Global_active_power"].mean()

print("Consumo m√©dio - Dias de semana vs Finais de semana:")
print(week_comparison)

## Quest√£o 11

In [None]:

for col in ["Global_active_power", "Global_reactive_power", "Voltage", "Global_intensity"]:
    df[col] = pd.to_numeric(df[col], errors="coerce")

correlation = df[["Global_active_power", "Global_reactive_power", "Voltage", "Global_intensity"]].corr()

print("Matriz de correla√ß√£o:")
print(correlation)

## Quest√£o 12

In [None]:

for col in ["Sub_metering_1", "Sub_metering_2", "Sub_metering_3"]:
    df[col] = pd.to_numeric(df[col], errors="coerce")

df["Total_Sub_metering"] = df["Sub_metering_1"] + df["Sub_metering_2"] + df["Sub_metering_3"]

print(df[["Sub_metering_1", "Sub_metering_2", "Sub_metering_3", "Total_Sub_metering"]].head())

## Quest√£o 13

In [None]:

monthly_total = df.groupby(df["Date"].dt.to_period("M"))["Total_Sub_metering"].mean()
monthly_global = df.groupby(df["Date"].dt.to_period("M"))["Global_active_power"].mean()

comparison = monthly_total > monthly_global

print("Meses em que Total_Sub_metering > Global_active_power:")
print(comparison[comparison == True])

## Quest√£o 14

In [None]:

voltage_2008 = df[df["Date"].dt.year == 2008]

plt.figure(figsize=(12,5))
plt.plot(voltage_2008["Date"], voltage_2008["Voltage"], color="orange")
plt.title("S√©rie Temporal do Voltage em 2008")
plt.xlabel("Data")
plt.ylabel("Voltage (V)")
plt.show()

## Quest√£o 15

In [None]:

df["Month"] = df["Date"].dt.month

summer = df[df["Month"].isin([6,7,8])]
winter = df[df["Month"].isin([12,1,2])]

summer_mean = summer["Global_active_power"].mean()
winter_mean = winter["Global_active_power"].mean()

print("M√©dia consumo ver√£o:", summer_mean)
print("M√©dia consumo inverno:", winter_mean)

## Quest√£o 16

In [None]:
sample = df.sample(frac=0.01, random_state=42)

plt.figure(figsize=(12,5))

plt.hist(df["Global_active_power"].dropna(), bins=50, alpha=0.5, label="Base completa")

plt.hist(sample["Global_active_power"].dropna(), bins=50, alpha=0.5, label="Amostra 1%")

plt.title("Distribui√ß√£o Global Active Power - Base Completa vs Amostra 1%")
plt.xlabel("Global Active Power (kW)")
plt.ylabel("Frequ√™ncia")
plt.legend()
plt.show()

## Quest√£o 17

In [None]:

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

cols = ["Global_active_power", "Global_reactive_power", "Voltage", "Global_intensity"]
df_scaled = df.copy()
df_scaled[cols] = scaler.fit_transform(df[cols])

print(df_scaled[cols].head())

## Quest√£o 18

In [None]:

from sklearn.cluster import KMeans

daily_data = df.groupby(df["Date"].dt.date)["Global_active_power"].mean().dropna().values.reshape(-1,1)

kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(daily_data)

print("Cluster centers:", kmeans.cluster_centers_)

import numpy as np
unique, counts = np.unique(labels, return_counts=True)
print("Distribui√ß√£o de dias por cluster:", dict(zip(unique, counts)))

## Quest√£o 19

In [None]:

from statsmodels.tsa.seasonal import seasonal_decompose

six_months = df[(df["Date"] >= "2007-01-01") & (df["Date"] <= "2007-06-30")]

series = six_months.groupby("Date")["Global_active_power"].mean()

decomposition = seasonal_decompose(series, model="additive", period=30)
decomposition.plot()
plt.show()

# uest√£o 20

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

X = df[["Global_intensity"]].dropna()
y = df["Global_active_power"].dropna()

valid = X.index.intersection(y.index)
X = X.loc[valid]
y = y.loc[valid]

model = LinearRegression()
model.fit(X, y)

y_pred = model.predict(X)

mse = mean_squared_error(y, y_pred)
print("Coeficiente angular:", model.coef_[0])
print("Intercepto:", model.intercept_)
print("Erro quadr√°tico m√©dio (MSE):", mse)

#PARTE 2 ‚Äì Exerc√≠cios adicionais no dataset inicial

## Quest√£o 21

In [None]:

df["Datetime"] = pd.to_datetime(df["Date"].astype(str) + " " + df["Time"], errors="coerce")

df.set_index("Datetime", inplace=True)

hourly = df["Global_active_power"].resample("H").mean()

print("Consumo m√©dio hor√°rio:")
print(hourly.head())

print("Top 5 hor√°rios de maior consumo:")
print(hourly.groupby(hourly.index.hour).mean().sort_values(ascending=False).head())

## Quest√£o 22

In [None]:

from pandas.plotting import autocorrelation_plot

autocorrelation_plot(hourly.dropna())
plt.show()

lag_1h = hourly.autocorr(lag=1)
lag_24h = hourly.autocorr(lag=24)
lag_48h = hourly.autocorr(lag=48)

print("Autocorrela√ß√£o 1h:", lag_1h)
print("Autocorrela√ß√£o 24h:", lag_24h)
print("Autocorrela√ß√£o 48h:", lag_48h)

## Quest√£o 23

In [None]:

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

features = ["Global_active_power", "Global_reactive_power", "Voltage", "Global_intensity"]
X = df[features].dropna()

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

print("Vari√¢ncia explicada por cada componente:", pca.explained_variance_ratio_)
print("Vari√¢ncia total explicada:", pca.explained_variance_ratio_.sum())

## Quest√£o 24

In [None]:

kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X_pca)

plt.figure(figsize=(8,6))
plt.scatter(X_pca[:,0], X_pca[:,1], c=labels, cmap="viridis", alpha=0.5)
plt.title("Clusters no espa√ßo PCA (2 componentes)")
plt.xlabel("PCA 1")
plt.ylabel("PCA 2")
plt.colorbar(label="Cluster")
plt.show()

## Quest√£o 25

In [None]:

import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error

X = df[["Voltage"]].dropna()
y = df["Global_active_power"].dropna()

valid = X.index.intersection(y.index)
X = X.loc[valid]
y = y.loc[valid]

lin_reg = LinearRegression()
lin_reg.fit(X, y)
y_pred_lin = lin_reg.predict(X)

poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)

poly_reg = LinearRegression()
poly_reg.fit(X_poly, y)
y_pred_poly = poly_reg.predict(X_poly)

rmse_lin = mean_squared_error(y, y_pred_lin, squared=False)
rmse_poly = mean_squared_error(y, y_pred_poly, squared=False)

print("RMSE Linear:", rmse_lin)
print("RMSE Polinomial:", rmse_poly)

plt.figure(figsize=(8,5))
plt.scatter(X, y, s=10, alpha=0.3, label="Dados reais")
plt.plot(X, y_pred_lin, color="red", label="Regress√£o Linear")
plt.scatter(X, y_pred_poly, color="green", s=1, alpha=0.3, label="Regress√£o Polinomial (grau 2)")
plt.xlabel("Voltage (V)")
plt.ylabel("Global Active Power (kW)")
plt.legend()
plt.show()

#PARTE 3 ‚Äì Novo dataset Appliances Energy Prediction

## Quest√£o 26

In [None]:

df_app = pd.read_csv("energydata_complete.csv")


print(df_app.info())
print(df_app.describe())

## Quest√£o 27

In [None]:

plt.figure(figsize=(8,5))
plt.hist(df_app["Appliances"], bins=50, color="skyblue", edgecolor="black")
plt.title("Distribui√ß√£o do consumo de Appliances")
plt.xlabel("Consumo (Wh)")
plt.ylabel("Frequ√™ncia")
plt.show()

plt.figure(figsize=(12,5))
plt.plot(df_app["Appliances"][:500], color="orange")  
plt.title("Consumo de Appliances (exemplo)")
plt.xlabel("Tempo (registros)")
plt.ylabel("Consumo (Wh)")
plt.show()

## Quest√£o 28

In [None]:

corr = df_app.corr(numeric_only=True)["Appliances"].sort_values(ascending=False)

print("Correla√ß√£o de Appliances com outras vari√°veis:")
print(corr)

## Quest√£o 29

In [None]:

num_cols = df_app.select_dtypes(include=["float64","int64"]).columns

scaler = MinMaxScaler()
df_app_scaled = df_app.copy()
df_app_scaled[num_cols] = scaler.fit_transform(df_app[num_cols])

print(df_app_scaled[num_cols].head())

## Quest√£o 30

In [None]:

X = df_app_scaled[num_cols].drop(columns=["Appliances"])  # remover target

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

print("Vari√¢ncia explicada:", pca.explained_variance_ratio_)

plt.figure(figsize=(8,6))
plt.scatter(X_pca[:,0], X_pca[:,1], alpha=0.3, s=10, cmap="viridis")
plt.title("PCA - 2 Componentes")
plt.xlabel("Componente 1")
plt.ylabel("Componente 2")
plt.show()

## Quest√£o 31

In [None]:

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

X = df_app_scaled.drop(columns=["Appliances"])
y = df_app_scaled["Appliances"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
y_pred = lin_reg.predict(X_test)

print("R¬≤:", r2_score(y_test, y_pred))
print("RMSE:", mean_squared_error(y_test, y_pred, squared=False))

## Quest√£o 32

In [None]:

from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

y_pred_rf = rf.predict(X_test)

print("Random Forest - RMSE:", mean_squared_error(y_test, y_pred_rf, squared=False))
print("Random Forest - R¬≤:", r2_score(y_test, y_pred_rf))

## Quest√£o 33

In [None]:

X_cluster = df_app_scaled.drop(columns=["Appliances"])

kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X_cluster)

print("Centros dos clusters (3 grupos):")
print(kmeans.cluster_centers_)

unique, counts = np.unique(labels, return_counts=True)
print("Distribui√ß√£o:", dict(zip(unique, counts)))

## Quest√£o 34

In [None]:

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

median_val = df_app["Appliances"].median()
df_app["High_Consumption"] = (df_app["Appliances"] > median_val).astype(int)

X = df_app_scaled.drop(columns=["Appliances"])
y = df_app["High_Consumption"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train, y_train)
y_pred_log = log_reg.predict(X_test)

rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)
y_pred_rf = rf_clf.predict(X_test)

print("Logistic Regression - Acur√°cia:", log_reg.score(X_test, y_test))
print("Random Forest Classifier - Acur√°cia:", rf_clf.score(X_test, y_test))

## Quest√£o 35

In [None]:

from sklearn.metrics import confusion_matrix, classification_report

print("Matriz de confus√£o - Logistic Regression")
print(confusion_matrix(y_test, y_pred_log))
print(classification_report(y_test, y_pred_log))

print("Matriz de confus√£o - Random Forest Classifier")
print(confusion_matrix(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf))

#PARTE 4 - Exerc√≠cios no Orange Data Mining(Iremos comentar os passos a passos de cada a√ß√£o)

## Quest√£o 36

In [None]:
# No Orange, comecei carregando o dataset pelo widget CSV File Import. Depois conectei ao Data Table para enxergar as primeiras linhas.
# A tabela mostra v√°rias vari√°veis como Date, Time, Global_active_power, Voltage, Global_intensity, Sub_metering_1, entre outras.
# Ao observar, d√° para notar que o dataset tem milh√µes de registros, pois ele cont√©m medi√ß√µes minuto a minuto durante v√°rios anos.
# As vari√°veis num√©ricas j√° aparecem reconhecidas corretamente, enquanto as colunas de data/hora podem precisar de tratamento extra dependendo da an√°lise.

## Quest√£o 37

In [None]:
# Depois usei o widget Sample Data e configurei para pegar apenas 1% da base.
# Essa amostra, apesar de pequena, ainda tem milhares de registros, ent√£o continua sendo representativa.
# Conectando ao Distribution (vari√°vel Global_active_power), percebi que o formato do histograma da amostra √© muito parecido com o da base completa.
# Ou seja, mesmo com 1% dos dados, a distribui√ß√£o n√£o se altera muito: a maior parte dos valores fica em torno de consumos baixos, com alguns picos.
# Isso mostra que a amostragem preserva a caracter√≠stica da base.

## Quest√£o 38

In [None]:

# Usei o widget Distribution especificamente para visualizar Global_active_power.
# O histograma confirma que o consumo √© concentrado em valores baixos (a maior parte abaixo de 2 kW).
# H√° registros com valores bem mais altos, mas em menor quantidade ‚Äî s√£o os ‚Äúpicos‚Äù de consumo, provavelmente em hor√°rios de maior uso de aparelhos el√©tricos.
# Isso indica que, em geral, o consumo das resid√™ncias √© moderado, mas existem momentos de uso intenso que puxam a curva para a direita.

## Quest√£o 39

In [None]:

# No Scatter Plot, configurei Voltage no eixo X e Global_intensity no eixo Y.
# O gr√°fico mostra uma tend√™ncia clara de correla√ß√£o positiva: √† medida que a intensidade (corrente el√©trica) aumenta, tamb√©m h√° varia√ß√£o associada na tens√£o.
# Os pontos n√£o ficam em uma linha reta perfeita, mas a nuvem de dados sugere que existe rela√ß√£o direta entre as duas vari√°veis.
# Isso faz sentido do ponto de vista f√≠sico: maior intensidade de corrente est√° ligada a maior pot√™ncia consumida, e isso impacta a rede el√©trica.

## Quest√£o 40

In [None]:

# Apliquei o widget k-Means configurado com 3 clusters, usando as vari√°veis Sub_metering_1, Sub_metering_2 e Sub_metering_3. Ao conectar ao Scatter Plot, os dados foram coloridos conforme os clusters encontrados.
# A interpreta√ß√£o √© que cada cluster representa um padr√£o distinto de consumo dom√©stico. Por exemplo:
# Um grupo pode ter consumo maior em Sub_metering_1 (cozinha, ilumina√ß√£o).
# Outro grupo pode ser mais alto em Sub_metering_2 (aquecimento ou √°gua quente).
# O terceiro pode estar mais equilibrado entre os tr√™s submeterings.
# Ou seja, o K-Means conseguiu separar perfis de consumo diferentes, que podem representar comportamentos energ√©ticos distintos das fam√≠lias.