# Classification using linear models

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from sklearn import linear_model
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split

# Annual power consumption data

## Case description / business understanding 

Electric utility companies do not know much about their customers. The information that is available for all customers is their consumption and address. In addition, some utility companies use web-portals to engage a part of their users and collect some additional information.


Some example questions the utility companies want to answer with the help of this data

* Who are the customers using the efficiency portal?
* Can we learn from the information provided on the portal? Can we predict this information for other customers?
* What are the typical customers?

We use a dataset that consists of two combined parts

1. Yearly electricity consumption (address and consumption in kW).
2. Data collected via an efficiency web application (detailed household information and activity on the customers)


In [None]:
apc = pd.read_csv('data/APC-dataset-anonym.csv', sep=';')
apc.head(5)
#apc.shape

|Variable          | Description|
|------------------|------------|
|ID	Unique         |customer ID |
|PLZ               |address information: postal code|
|City              |address information: city|
|Strasse           |address information: street|
|Betreff           |miscellaneous information about the meter and the housing|
|Cons_2011;Cons_2012;Cons_2013 |consumption in kWh per year|
|Days_2011;Days_2012;Days_2013 | days in one year in which the consumption was created|
|FilterNonHousehold |a filter created by the utility; the company is not sure if it covers all non-households|
|Portal            |indicates whether the customer uses the energy efficiency portal|
|pPoints           |points on the portal|
|pEarnedPoints     |earned points on the porta|
|pHouseholdType    |type of housing |
|pMainHeatingType  |the main heating type of the household | 
|pWaterHeatingType |the type of water heating|
|pLivingAreaM2     |the living area of the household|
|pHouseholdMembers |the number of people living in the household|
|pDateCreated      |timestamp of account creation|
|pLastVisited      |Visited	timestamp of the last visit|



In [None]:
apc.describe(include='all')

In [None]:
apc.isnull().sum()

In [None]:
apc.dtypes

## Data preparation - exercises
1. Inspect the output of the previous chunks. What are problematic values in the data, and how could we handle them?
2. Use the function `value_counts()` to see the distribution of the column `Portal`

In [None]:
# isdentify problems here

## A linear regression model for portal usage
Look into the script of the last session how a model is trained using linear regression using the `scikit learn` package. 

### Exercise: Estimate regression model for portal usage
1. Try to build a model that estimates the variable `Portal` based on the code from last session.
2. Inspect the model: What can you say about its quality. You can use `np.corrcoef()` ([see documentation](https://numpy.org/doc/2.2/reference/generated/numpy.corrcoef.html)) to compute a correlation between the predicted and the true data. You can also use the function `classification_report()` from scikit learn ([see documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)).

In [None]:
# Create linear regression object





In [None]:
#correlation between prediction and test data

#classification statistics


## A logistic regression model 

The linear model is incapable of predicting portal users. This is unsatisfactory. Let's try to use a logistic model. 

In [None]:
# instantiate the model (using the default parameters)
logreg = linear_model.LogisticRegression(random_state=16)

# fit the model with data
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)
cnf_matrix = confusion_matrix(y_test, y_pred)
cnf_matrix

print(classification_report(y_test, y_pred))
