# Autoformat notebook example

This notebook contains some code that violates PEP8 style guidelines.  It is designed to be used with the `black` autoformatting tool to demonstrate how code is modified for you.

**RECOMMENDED**: make a **copy** of this notebook before you auto-format it!  That way you can try different settings to see the results.

## Imports

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

## 1. Violations of PEP8 in the code example.

* Line length: Many lines exceed 79 or 100 characters.
* Inconsistent indentation and whitespace: There are irregular spaces between function parameters and after commas.
* No blank lines between logical sections of the code.
* Improper string formatting: The long print statement at the end should be broken into multiple lines.

In [2]:
def messy_data_analysis(
    data_frame,
    columns_to_process,
    numeric_columns,
    categorical_columns,
    target_variable,
):
    # Preprocessing
    data_frame = data_frame[columns_to_process]
    data_frame[numeric_columns] = data_frame[numeric_columns].fillna(
        data_frame[numeric_columns].mean()
    )
    data_frame[categorical_columns] = data_frame[categorical_columns].fillna(
        data_frame[categorical_columns].mode().iloc[0]
    )

    # Feature engineering
    scaler = StandardScaler()
    data_frame[numeric_columns] = scaler.fit_transform(
        data_frame[numeric_columns]
    )

    encoder = OneHotEncoder(sparse=False, handle_unknown="ignore")
    encoded_categorical = encoder.fit_transform(
        data_frame[categorical_columns]
    )
    encoded_feature_names = encoder.get_feature_names(categorical_columns)

    encoded_df = pd.DataFrame(
        encoded_categorical,
        columns=encoded_feature_names,
        index=data_frame.index,
    )

    processed_data = pd.concat(
        [data_frame[numeric_columns], encoded_df, data_frame[target_variable]],
        axis=1,
    )

    # Split the data
    X = processed_data.drop(columns=[target_variable])
    y = processed_data[target_variable]
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )

    # Train a simple model
    model = LogisticRegression(random_state=42, max_iter=1000)
    model.fit(X_train, y_train)

    # Evaluate the model
    train_accuracy = model.score(X_train, y_train)
    test_accuracy = model.score(X_test, y_test)

    print(
        f"This function performs data preprocessing, feature engineering, and trains a logistic regression model on the given dataset. The model achieves a training accuracy of {train_accuracy:.2f} and a test accuracy of {test_accuracy:.2f}. Please note that this is a basic analysis and may not be suitable for all datasets or problems. Further optimization and model selection might be necessary for better results."
    )

    return model, train_accuracy, test_accuracy

## 2. What does flake8 tell use?

To obtain the output below run the following in a terminal

```bash
nbqa flake8 03_notebook_to_format.ipynb
```

Fixing all of the below will take several iterations. Instead we will use `black` to autoformat the notebook

```
03_notebook_to_format.ipynb:cell_1:2:1: F401 'numpy as np' imported but unused
03_notebook_to_format.ipynb:cell_2:1:35: E231 missing whitespace after ','
03_notebook_to_format.ipynb:cell_2:1:74: E231 missing whitespace after ','
03_notebook_to_format.ipynb:cell_2:1:80: E501 line too long (117 > 79 characters)
03_notebook_to_format.ipynb:cell_2:1:118: W291 trailing whitespace
03_notebook_to_format.ipynb:cell_2:4:80: E501 line too long (104 > 79 characters)
03_notebook_to_format.ipynb:cell_2:5:80: E501 line too long (124 > 79 characters)
03_notebook_to_format.ipynb:cell_2:6:1: W293 blank line contains whitespace
03_notebook_to_format.ipynb:cell_2:9:80: E501 line too long (83 > 79 characters)
03_notebook_to_format.ipynb:cell_2:10:1: W293 blank line contains whitespace
03_notebook_to_format.ipynb:cell_2:12:80: E501 line too long (80 > 79 characters)
03_notebook_to_format.ipynb:cell_2:14:1: W293 blank line contains whitespace
03_notebook_to_format.ipynb:cell_2:15:80: E501 line too long (105 > 79 characters)
03_notebook_to_format.ipynb:cell_2:16:1: W293 blank line contains whitespace
03_notebook_to_format.ipynb:cell_2:17:80: E501 line too long (110 > 79 characters)
03_notebook_to_format.ipynb:cell_2:18:1: W293 blank line contains whitespace
03_notebook_to_format.ipynb:cell_2:22:80: E501 line too long (93 > 79 characters)
03_notebook_to_format.ipynb:cell_2:23:1: W293 blank line contains whitespace
03_notebook_to_format.ipynb:cell_2:27:1: W293 blank line contains whitespace
03_notebook_to_format.ipynb:cell_2:31:1: W293 blank line contains whitespace
03_notebook_to_format.ipynb:cell_2:32:80: E501 line too long (421 > 79 characters)
03_notebook_to_format.ipynb:cell_2:33:1: W293 blank line contains whitespace
```

## Using black to autoformat the document

> **Remember** to make a copy of the original notebook first.

Try the following and compare the output.

```
nbqa black [insert notebook name].ipynb
```

```
nbqa black [insert notebook name].ipynb --line-length=79
```

> What happened to the string in the final print?