Letâ€™s say that weâ€™re building a model to predict real estate home prices in a particular city.

We analyze the distribution of the home prices and see that the homes values are skewed to the right.

Do we need to do anything or take it into consideration? If so, what should we do?

Bonus: Letâ€™s say you see your target distribution is heavily left instead. What do you do now?

In [None]:
Common in real estate â€” a small number of luxury homes can inflate the distribution.

ðŸŽ¯ Why itâ€™s a problem:
	â€¢	Models like linear regression assume normally distributed residuals and can be sensitive to outliers.
	â€¢	Skew can distort the relationship between predictors and target.

ðŸ›  What to do:
	â€¢	Apply a log transformation to the target:

    	â€¢	This compresses large values and often stabilizes variance, leading to better model performance and interpretability.
	â€¢	After prediction, exponentiate the output to get back to original scale:
    predicted_price = np.exp(predicted_log_price)



organization requires the temperature readings for each day, so they ask you to interpolate the missing data.

Write a Python function using Pandas that uses a linear interpolation to estimate the missing data and fill out the dataframe.

Notes:

When estimating the missing values for a city, the interpolation should only consider data from the same city.
Temperature recording issues are rare, so you can assume that there is no missing data two days in a row.
You can also assume that both the first and the last dates in your dataframe hold valid temperature data.

In [None]:
import pandas as pd

def interpolate_temperature(df: pd.DataFrame) -> pd.DataFrame:
    """
    Fill missing temperature data using linear interpolation, grouped by city.

    Assumptions:
    - No two consecutive days have missing data.
    - First and last dates for each city have valid temperature values.

    Parameters:
    - df: DataFrame with columns ['date', 'city', 'temperature']

    Returns:
    - DataFrame with missing 'temperature' filled in by linear interpolation.
    """
    df['date'] = pd.to_datetime(df['date'])
    df = df.sort_values(['city', 'date'])

    # Group by city and interpolate missing temperature values linearly
    df['temperature'] = df.groupby('city')['temperature'].transform(lambda group: group.interpolate(method='linear'))

    return df

ðŸ”¹ 1. df.groupby('city')

This splits the dataframe into groups based on unique values in the city column.
Each group now contains rows for only one city.

â¸»

ðŸ”¹ 2. ['temperature']

Within each city group, we isolate the temperature column â€” the one with missing values that we want to fill in.

â¸»

ðŸ”¹ 3. .transform(...)

The .transform() method returns a series with the same index as the original DataFrame, which means the result can be directly assigned back to df['temperature'].

This is important because .apply() would return a smaller grouped result and may break index alignment.

â¸»

ðŸ”¹ 4. lambda group: group.interpolate(method='linear')

This function is applied to the temperature column of each cityâ€™s group:
	â€¢	group is a Series of temperatures for that city.
	â€¢	group.interpolate(method='linear') uses linear interpolation to fill in any missing values:
\text{missing value} = \text{linear value between known temperatures}

In [None]:
data = {
    'date': ['2023-01-01', '2023-01-02', '2023-01-03',
             '2023-01-01', '2023-01-02', '2023-01-03'],
    'city': ['A', 'A', 'A', 'B', 'B', 'B'],
    'temperature': [10, None, 14, 5, None, 7]
}

df = pd.DataFrame(data)
filled_df = interpolate_temperature(df)
print(filled_df)

        date city  temperature
0 2023-01-01    A         10.0
1 2023-01-02    A         12.0
2 2023-01-03    A         14.0
3 2023-01-01    B          5.0
4 2023-01-02    B          6.0
5 2023-01-03    B          7.0
