# Homework




This will use ipytest to test the class. To use ipytest, you need to install
it first. You can install it using pip:


In [None]:
!pip install ipytest


Then you need to run the `autoconfig` command to set up ipytest:


In [4]:
import ipytest
ipytest.autoconfig()

Now you can just run a cell with the `ipytest` cell magic to run the tests.
Here is some sample code we want to test:

In [1]:
def add(a, b):
    """This function adds two numbers."""
    return a + b

Here is the test code for this function:

In [5]:
%%ipytest

def test_add():
    assert add(1, 2) == 3
    assert add(0, 0) == 0
    assert add(-1, 1) == 0

def test_add_strings():
    assert add("a", "b") == "ab"

def test_documentation():
    assert add.__doc__ == "Add two objects"

[32m.[0m[32m.[0m[31mF[0m[31m                                                                                          [100%][0m
[31m[1m________________________________________ test_documentation ________________________________________[0m

    [94mdef[39;49;00m [92mtest_documentation[39;49;00m():[90m[39;49;00m
>       [94massert[39;49;00m add.[91m__doc__[39;49;00m == [33m"[39;49;00m[33mAdd two objects[39;49;00m[33m"[39;49;00m[90m[39;49;00m
[1m[31mE       AssertionError: assert 'This functio... two numbers.' == 'Add two objects'[0m
[1m[31mE         - Add two objects[0m
[1m[31mE         + This function adds two numbers.[0m

[1m[31m/var/folders/qn/r8_0pgj1645dn1w69vqls6cw0000gn/T/ipykernel_30233/3117177672.py[0m:10: AssertionError
[31mFAILED[0m t_f007e76f0ec84b6b900195f6afcb0423.py::[1mtest_documentation[0m - AssertionError: assert 'This functio... two numbers.' == 'Add two objects'
[31m[31m[1m1 failed[0m, [32m2 passed[0m[31m in 0.06s[0

If you have installed ipytest and run the autoconfig command, when you run the above cell, it will run the tests and show the results.

```
..F                                                                                          [100%]
============================================= FAILURES =============================================
________________________________________ test_documentation ________________________________________

    def test_documentation():
>       assert add.__doc__ == "Add two objects"
E       AssertionError: assert 'This functio... two numbers.' == 'Add two objects'
E         - Add two objects
E         + This function adds two numbers.

/var/folders/qn/r8_0pgj1645dn1w69vqls6cw0000gn/T/ipykernel_30233/3117177672.py:10: AssertionError
===================================== short test summary info ======================================
FAILED t_f007e76f0ec84b6b900195f6afcb0423.py::test_documentation - AssertionError: assert 'This functio... two numbers.' == 'Add two objects'
1 failed, 2 passed in 0.06s
```

The two periods mean that two tests passed. The 'F' means one test failed. The output shows the failed test and the reason for the failure.
The documentation string in the function is different from what the test expects. You can fix the function or the test to make the test pass.

# Problem 1: Log Transformer

Write a class called 'LogTransformer' that inherits from BaseEstimator 
and TransformerMixin. This transformer should apply a log(x + 1) transformation 
to the input features. Include fit and transform methods.

When you are done, run the cell with the tests in it.

In [20]:
# put your code for LogTransformer here


Run the cell below when you are done to test the class.

In [22]:
%%ipytest
import pytest
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils.validation import check_array, check_is_fitted
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification


def test_has_fit_method():
    assert hasattr(LogTransformer, "fit"), "LogTransformer should have a fit method"

def test_has_transform_method():
    assert hasattr(LogTransformer, "transform"), "LogTransformer should have a transform method"

def test_log_transformer():
    X = np.array([[1, 10, 100], [2, 20, 200]])
    transformer = LogTransformer()
    X_transformed = transformer.fit_transform(X)
    assert isinstance(transformer, BaseEstimator), "Must inherit from BaseEstimator"
    assert isinstance(transformer, TransformerMixin), "Must inherit from TransformerMixin"
    assert X_transformed.shape == X.shape
    np.testing.assert_array_almost_equal(
        X_transformed, 
        np.log1p(X)
    )

def test_works_with_pandas():
    import pandas as pd
    X = pd.DataFrame([[1, 10, 100], [2, 20, 200]])
    transformer = LogTransformer()
    X_transformed = transformer.fit_transform(X)
    assert isinstance(X_transformed, pd.DataFrame)
    np.testing.assert_array_almost_equal(
        X_transformed.values, 
        np.log1p(X.values)
    )


[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m                                                                                         [100%][0m
[32m[32m[1m4 passed[0m[32m in 0.01s[0m[0m


# Problem 2: Random Column Creator

Write a class called 'RandomColumnCreator' that inherits from BaseEstimator
and TransformerMixin. This transformer should add a new column to the input
features with random values. Include fit and transform methods. The constructor
should take a parameter 'column_name' which is the name of the new column.
Use the numpy function `np.random.rand` to generate the random values.

When you are done, run the cell with the tests in it.

In [31]:
# put your code here

    

In [32]:
%%ipytest

import pandas as pd

def test_rcc_has_fit_method():
    assert hasattr(RandomColumnCreator, "fit"), "RandomColumnCreator should have a fit method"

def test_rcc_has_transform_method():
    assert hasattr(RandomColumnCreator, "transform"), "RandomColumnCreator should have a transform method"

def test_random_column_creator():
    X = pd.DataFrame([[1, 10, 100], [2, 20, 200]])
    rcc = RandomColumnCreator(column_name="random", seed=0)
    X_transformed = rcc.fit_transform(X)
    assert isinstance(rcc, BaseEstimator), "Must inherit from BaseEstimator"
    assert isinstance(rcc, TransformerMixin), "Must inherit from TransformerMixin"
    assert X_transformed.shape[1] == X.shape[1] + 1
    assert "random" in X_transformed.columns


[32m.[0m[32m.[0m[32m.[0m[32m                                                                                          [100%][0m
[32m[32m[1m3 passed[0m[32m in 0.01s[0m[0m


# Problem 3: Lambdas


## Write a lambda function called 'square' that takes a number and returns its square.


In [33]:
# Put your code here


In [34]:

%%ipytest
def test_square():
    assert square(5) == 25
    assert square(-3) == 9
    assert square(0) == 0


[32m.[0m[32m                                                                                            [100%][0m
[32m[32m[1m1 passed[0m[32m in 0.00s[0m[0m



## Create a lambda function called 'is_even' that takes a number and returns True if it's even, False otherwise.



In [None]:
# put your code here

In [None]:
%%ipytest
def test_is_even():
    assert is_even(4) == True
    assert is_even(7) == False
    assert is_even(0) == True



## Write a lambda function called 'concat_strings' that takes two strings  and returns them concatenated with a space in between.


In [35]:
# put your code here

In [None]:
%%ipytest
def test_concat_strings():
    assert concat_strings("Hello", "World") == "Hello World"
    assert concat_strings("Python", "Lambda") == "Python Lambda"
    assert concat_strings("", "Test") == " Test"


## Write a function called add_one_to_column that:

Takes a pandas DataFrame df and a column name col as input.
Returns a DataFrame where 1 is added to each element of the specified column using the pandas .assign method and a lambda function.

In [56]:
# put your code here



In [57]:
%%ipytest

import pytest

def test_add_one_to_column():
    df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
    result = add_one_to_column(df, 'a')
    expected = pd.DataFrame({'a': [2, 3, 4], 'b': [4, 5, 6]})
    pd.testing.assert_frame_equal(result, expected, "Values in column 'a' should be incremented by 1.")

def test_no_mutation():
    df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
    result = add_one_to_column(df, 'a')
    with pytest.raises(AssertionError):
        pd.testing.assert_frame_equal(df, result, "Original DataFrame should not be modified")

def test_assign_called2(monkeypatch):
    # create a spy function to track if 'assign' is called
    assign_called = False

    def spy_assign(self, **kwargs):
        nonlocal assign_called
        assign_called = True
        return self

    df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
    monkeypatch.setattr(pd.DataFrame, 'assign', spy_assign)
    
    add_one_to_column(df, 'a')
    
    assert assign_called, "pd.DataFrame.assign should have been called."



[32m.[0m[32m.[0m[32m.[0m[32m                                                                                          [100%][0m
[32m[32m[1m3 passed[0m[32m in 0.01s[0m[0m


The z-score is calculated using the following formula:



## Write a function called `zscore` that takes a dataframe and a column name as input and returns the z-score of the column. 

The formula for the z-score is:

$$ z = \frac{X - \mu}{\sigma} $$

where:
- $ z $ is the z-score
- $ X $ is the value
- $ \mu $ is the mean of the population
- $ \sigma $ is the standard deviation of the population



In [59]:
# write your code here



In [63]:
%%ipytest

def test_zscore():
    df = pd.DataFrame({'a': [1, 2, 3, 4, 5]})
    result = zscore(df, 'a')
    print(result)
    expected = pd.Series([-1.2649110640673518, -0.6324555320336759, 0.0, 0.6324555320336759, 1.2649110640673518],
                         name='a')
    pd.testing.assert_series_equal(result, expected, "Z-score should be calculated correctly")


[32m.[0m[32m                                                                                            [100%][0m
[32m[32m[1m1 passed[0m[32m in 0.01s[0m[0m


## Write a function add_zscore_to_city

Write a function called `add_zscore_to_city` that takes a pandas DataFrame df as input and returns a DataFrame where a new column called 'city_zscore' is added. 


When you are done, run the cell with the tests in it.

In [98]:
# put your code here

In [72]:
%%ipytest
import pandas as pd
import pytest

@pytest.fixture
def df():
    url = 'https://github.com/mattharrison/datasets/raw/master/data/vehicles.csv.zip'
    raw_data = pd.read_csv(url, dtype_backend='pyarrow')
    return raw_data

def test_add_zscore_column(df):
    result = add_zscore_column(df)
    assert 'city_zscore' in result.columns
    

[32m.[0m[32m                                                                                            [100%][0m
[32m[32m[1m1 passed[0m[32m in 0.75s[0m[0m


## Grouping, iloc, and assign

Write a function called `ford_city_zscore` that:

- Takes a pandas DataFrame df as input.
- Filters the DataFrame to include only rows where the 'make' column is 'Ford'.
- Groups the DataFrame by the 'year' column.
- Calculates the mean of the 'city08' column for each group.
- Create a new column `city_zscore` with the z-score of the aggregated 'city08' column.
- Returns rows where the z-score is greater than 1. (using loc)

When you are done, run the cell with the tests in it.



In [99]:
# put your code here


In [81]:
%%ipytest

import pandas as pd
import pytest

@pytest.fixture
def df():
    url = 'https://github.com/mattharrison/datasets/raw/master/data/vehicles.csv.zip'
    raw_data = pd.read_csv(url, dtype_backend='pyarrow')
    return raw_data

def test_ford_city_zscore(df):
    result = ford_city_zscore(df)
    display(result)
    assert result.shape[0] > 0
    assert 'city_zscore' in result.columns
    assert result['city_zscore'].min() > 1


Unnamed: 0_level_0,city08,city_zscore
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2012,20.089744,1.262842
2013,20.710843,1.539876
2014,20.827586,1.591948
2015,22.350649,2.271291
2016,21.465909,1.876664
2017,20.848485,1.601269
2018,20.345455,1.376899
2019,19.742268,1.107855
2020,21.888889,2.065328


[32m.[0m[32m                                                                                            [100%][0m
[32m[32m[1m1 passed[0m[32m in 0.95s[0m[0m


## Square all number

Write a function called `square_all_numbers` that:

- Uses `.assign` and a dictionary comprehension to create a new column for each numeric column in the DataFrame.
- The new column should be the square of the original column.
- The new column should be called 'NUM_COL_squared' where 'NUM_COL' is the name of the original column.
- Hint: Use `.select_dtypes` to select only numeric columns.

When you are done, run the cell with the tests in it

In [100]:
# put your code here


In [93]:
%%ipytest

import pandas as pd
import pytest

@pytest.fixture
def df():
    url = 'https://github.com/mattharrison/datasets/raw/master/data/vehicles.csv.zip'
    raw_data = pd.read_csv(url, dtype_backend='pyarrow')
    return raw_data

def test_square_all_numbers(df):
    result = square_all_numbers(df)
    int_squared = (df.select_dtypes(int) ** 2).rename(columns=lambda x: f'{x}_squared')
    int_cols = int_squared.columns
    pd.testing.assert_frame_equal(result[int_cols], int_squared, check_dtype=False, check_names=False)
    

[32m.[0m[32m                                                                                            [100%][0m
[32m[32m[1m1 passed[0m[32m in 0.81s[0m[0m


# Plotting

In [94]:
import pandas as pd
url = 'https://github.com/mattharrison/datasets/raw/master/data/vehicles.csv.zip'
raw_data = pd.read_csv(url, dtype_backend='pyarrow')

## Scatter Plot

Plot a scatter plot of the 'city08' column against the 'highway08' column in the DataFrame.

Plot a scatter plot of 'barrerl08' against 'city08' in the DataFrame.

## Histogram

Plot the distribution of the 'city08' column using a histogram.



Plot the distribution of the average of 'city08' by 'make' using a histogram.



Plot the distribution of the 'barrels08' column using a histogram.

## Bar Plot

Plot the average of 'city08' by 'make' using a bar plot.



Plot the count of 'make' using a bar plot.



Plot the mean 'city08' by 'year' using a bar plot.

## Line Plot

Plot the mean 'city08' by 'year' for 'make' equal to 'Ford' or 'Chevrolet' using a line plot.

Plot the maximum 'city08' by 'year' for 'make' equal to 'Ford' or 'Chevrolet' using a line plot.

Plot the minimum 'city08' by 'year' for 'make' equal to 'Ford' or 'Chevrolet' using a line plot.

### Misc

In [97]:
def strip_color_code(s):
    """ remove color codes from a string
[31m[1m________________________________________ test_documentation ________________________________________[0m

should be ________________________________________ test_documentation ________________________________________"""
    # use regular expression to remove color codes
    import re
    return re.sub(r'\x1b\[[0-9;]*m', '', s)


