
# Tema – Random Forest (Clasificare & Regresie)
## Analiza polling-ului politic din Canada

**Cerință:**  
- Antrenarea modelelor DOAR pe datele anterioare datei **2025-01-13**  
- Testarea pe datele de după această dată  
- Compararea rezultatelor prezise cu cele reale


In [7]:
!pip install -q gdown

!gdown "https://drive.google.com/uc?id=1xBhcsoJsQT3Lgk65UzgcprBlLmfxyyQz"
# !gdown --folder "https://drive.google.com/drive/folders/1RJCYu6hMsa3xOHn1-jZWsq4o0N10Pvt8"


Downloading...
From: https://drive.google.com/uc?id=1xBhcsoJsQT3Lgk65UzgcprBlLmfxyyQz
To: /content/canada_polling_data.csv
  0% 0.00/17.5k [00:00<?, ?B/s]100% 17.5k/17.5k [00:00<00:00, 44.2MB/s]


In [8]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import accuracy_score, classification_report, mean_squared_error, r2_score
import matplotlib.pyplot as plt
import warnings
from sklearn.model_selection import TimeSeriesSplit

warnings.filterwarnings('ignore')

## 1. Încărcarea și curățarea datelor

In [9]:

data = pd.read_csv('canada_polling_data.csv')

data = data.drop(columns=['Others'])
data['Last date of polling'] = pd.to_datetime(
    data['Last date of polling'], format='%B %d, %Y', errors='coerce'
)

party_columns = ['CPC', 'LPC', 'NDP', 'BQ', 'PPC', 'GPC']
for col in party_columns:
    data[col] = pd.to_numeric(data[col], errors='coerce')

data = data.sort_values('Last date of polling')
data.set_index('Last date of polling', inplace=True)

data.head()


Unnamed: 0_level_0,CPC,LPC,NDP,BQ,PPC,GPC
Last date of polling,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2021-09-20,33.7,32.6,17.8,7.6,4.9,2.3
2021-10-20,30.0,33.0,19.0,7.0,6.0,3.0
2021-10-22,30.8,29.9,23.1,7.3,5.3,2.7
2021-10-24,33.3,33.8,15.4,5.5,,3.0
2021-10-29,30.1,30.8,21.6,7.4,5.8,3.1


## 2. Definirea variabilelor țintă

In [10]:

data['Winner'] = data[party_columns].idxmax(axis=1)
data['Winner_Percentage'] = data[party_columns].max(axis=1)

X = data[party_columns].fillna(data[party_columns].mean())
y_class = data['Winner']
y_reg = data['Winner_Percentage']


## 3. Split temporal (înainte / după 2025-01-13)

In [11]:

cutoff_date = '2025-01-13'

X_train = X[X.index < cutoff_date]
X_test = X[X.index >= cutoff_date]

y_class_train = y_class[y_class.index < cutoff_date]
y_class_test = y_class[y_class.index >= cutoff_date]

y_reg_train = y_reg[y_reg.index < cutoff_date]
y_reg_test = y_reg[y_reg.index >= cutoff_date]

len(X_train), len(X_test)


(359, 65)

## 4. Antrenarea modelelor Random Forest

In [12]:

rf_classifier = RandomForestClassifier(
    n_estimators=1000,
    random_state=42,
    max_depth=20,
    min_samples_split=10
)
rf_classifier.fit(X_train, y_class_train)

rf_regressor = RandomForestRegressor(
    n_estimators=500,
    random_state=42
)
rf_regressor.fit(X_train, y_reg_train)


## 5. Predicții și evaluare

In [13]:

y_class_pred = rf_classifier.predict(X_test)
y_reg_pred = rf_regressor.predict(X_test)

print("=== CLASIFICARE ===")
print("Acuratețe:", accuracy_score(y_class_test, y_class_pred))
print(classification_report(y_class_test, y_class_pred))

print("\n=== REGRESIE ===")
print("MSE:", mean_squared_error(y_reg_test, y_reg_pred))
print("R²:", r2_score(y_reg_test, y_reg_pred))


=== CLASIFICARE ===
Acuratețe: 0.6923076923076923
              precision    recall  f1-score   support

         CPC       0.68      1.00      0.81        43
         LPC       1.00      0.09      0.17        22

    accuracy                           0.69        65
   macro avg       0.84      0.55      0.49        65
weighted avg       0.79      0.69      0.59        65


=== REGRESIE ===
MSE: 9.964087606153896
R²: 0.15625530847450664


## 6. Compararea rezultatelor reale cu cele prezise

In [14]:

comparison = pd.DataFrame({
    'Winner_Real': y_class_test,
    'Winner_Predicted': y_class_pred,
    'Percentage_Real': y_reg_test,
    'Percentage_Predicted': y_reg_pred
})

comparison.head()


Unnamed: 0_level_0,Winner_Real,Winner_Predicted,Percentage_Real,Percentage_Predicted
Last date of polling,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2025-01-13,CPC,CPC,47.0,46.9456
2025-01-14,CPC,CPC,46.0,45.9372
2025-01-15,CPC,CPC,45.0,45.0104
2025-01-16,CPC,CPC,39.0,39.0008
2025-01-17,CPC,CPC,45.2,45.126


In [15]:

comparison.to_csv('comparison_real_vs_predicted.csv')



## Concluzie
Modelele Random Forest au fost antrenate pe date istorice și evaluate pe date viitoare,
respectând structura temporală a problemei.
