### Using .apply()

In [166]:
import pandas as pd

In [167]:
df = pd.read_csv('kc_house_data.csv')

Taking an example from the previous assingment, I used a for loop to add a new column showing the distance of each house from the most expensive house. The same formula can be represented using apply.

First, import the required functions:

In [168]:
from math import pi, sin, cos, acos

def calc_distance(loc1, loc2):
    loc1[0] = loc1[0] * pi/180
    loc1[1] = loc1[1] * pi/180
    loc2[0] = loc2[0] * pi/180
    loc2[1] = loc2[1] * pi/180
    return acos(sin(loc1[0]) * sin(loc2[0]) + cos(loc1[0]) * cos(loc2[0]) * cos(loc2[1] - loc1[1]))  * 6371

def distance_between(id1, id2):
    if df[df.id == id1]['id'].count() == 0 or df[df.id == id2]['id'].count() == 0:
        return None
    house1 = [df.loc[df['id'] == id1, ['lat']].iat[0, 0], df.loc[df['id'] == id1, ['long']].iat[0, 0]]
    house2 = [df.loc[df['id'] == id2, ['lat']].iat[0, 0], df.loc[df['id'] == id2, ['long']].iat[0, 0]]
    return (calc_distance(house1, house2))

This time I will calculate distance from the least expensive house, which has ID 3421079032.

In [169]:
df['distance_from_min'] = df.apply(lambda row: distance_between(3421079032, row['id']), axis=1)

Check that it worked:

In [170]:
df[['id', 'price', 'distance_from_min']].sort_values('distance_from_min', ascending=False)

Unnamed: 0,id,price,distance_from_min
6636,226039316,941500.0,67.574971
18025,7280300375,536000.0,67.554172
306,7280300196,550000.0,67.471579
4537,7280300042,650000.0,67.439000
11728,7154200070,995000.0,67.398895
...,...,...,...
3852,3221079055,367000.0,2.275307
8707,2821079081,590000.0,1.986847
13137,3321079060,378000.0,1.550195
1389,4102000075,275000.0,0.344853


At a glance, this suggests that as distance to the cheapest house decreases, price seems to decrease as well.

### Using .apply() to bin values

Continuing the above example, I will add a flag to separate distances into either greater than 34km or less than 34km (half of greatest distance). This will help me understand if the trend of having houses of a similar price clustered with one another applies to the dataset as a whole.

In [171]:
df['over_34_km_from_min'] = df.apply(lambda row: 1 if row['distance_from_min'] > 34 else 0, axis=1)

In [172]:
df_ = df[['id', 'price', 'distance_from_min', 'over_34_km_from_min']]
df_

Unnamed: 0,id,price,distance_from_min,over_34_km_from_min
0,7129300520,221900.0,38.144715,1
1,6414100192,538000.0,59.559782,1
2,5631500400,180000.0,58.159586,1
3,2487200875,604000.0,46.481765,1
4,1954400510,510000.0,40.620019,1
...,...,...,...,...
21608,263000018,360000.0,58.637113,1
21609,6600060120,400000.0,43.956883,1
21610,1523300141,402101.0,47.173348,1
21611,291310100,400000.0,32.504629,0


Now I can filter by the two categories, and check the mean house price in each. I expect that the mean house price for houses flagged 1 'over_34_km_from_min' will be higher.

In [173]:
df_.loc[df['over_34_km_from_min'] == 1, 'price'].mean().round(0)

607148.0

In [174]:
df_.loc[df['over_34_km_from_min'] == 0, 'price'].mean().round(0)

367246.0

Conclusion: If we take the cheapest house, and divide our remaining houses into two categories (taking the halfway distance as the boundary), houses further away than the halfway distance have a higher mean price of 607,148 compared to 367,246 for houses closer to the cheapest house.

I am curious how many houses were in each category. I will check this:

In [175]:
df_.loc[df_['over_34_km_from_min'] == 1, 'id'].count()

15580

In [176]:
df_.loc[df_['over_34_km_from_min'] == 0, 'id'].count()

6033

It seems that most houses were in the over 34km category. Even though this was the halfway distance, there were more houses in the outer half. This makes sense in a circular settlement where the area gets larger as you move out. This leads to the next section.

### Binning differently

Another way to bin the houses in the above example is to take the half count that's closest vs. the half that's farthest, irrespective of the boundary distance. To do this, I will first sort my houses by distance from the cheapest and I will reset the index.

In [177]:
df_2 = df[['id', 'price', 'distance_from_min']].sort_values('distance_from_min', ascending=False).reset_index(drop=True)
df_2

Unnamed: 0,id,price,distance_from_min
0,226039316,941500.0,67.574971
1,7280300375,536000.0,67.554172
2,7280300196,550000.0,67.471579
3,7280300042,650000.0,67.439000
4,7154200070,995000.0,67.398895
...,...,...,...
21608,3221079055,367000.0,2.275307
21609,2821079081,590000.0,1.986847
21610,3321079060,378000.0,1.550195
21611,4102000075,275000.0,0.344853


In [178]:
median_index = len(df_2) / 2
median_index

10806.5

Finally, after much searching, I found out how to reference a row -- row.name! I tried row.index so many times, and could never figure out why it didn't work.

In [179]:
df_2['farther_half_of_houses'] = df_2.apply(lambda row: 1 if row.name < 10806 else 0, axis=1)
df_2

Unnamed: 0,id,price,distance_from_min,farther_half_of_houses
0,226039316,941500.0,67.574971,1
1,7280300375,536000.0,67.554172,1
2,7280300196,550000.0,67.471579,1
3,7280300042,650000.0,67.439000,1
4,7154200070,995000.0,67.398895,1
...,...,...,...,...
21608,3221079055,367000.0,2.275307,0
21609,2821079081,590000.0,1.986847,0
21610,3321079060,378000.0,1.550195,0
21611,4102000075,275000.0,0.344853,0


I expect there to be an equal number of houses flagged 0 and 1 this time.

In [180]:
df_2.loc[df_2['farther_half_of_houses'] == 1, 'id'].count()

10806

In [181]:
df_2.loc[df_2['farther_half_of_houses'] == 0, 'id'].count()

10807

Very good. Now I will compare means. I expect the farther half (flagged 1) to have a higher mean price.

In [182]:
df_2.loc[df_2['farther_half_of_houses'] == 1, 'price'].mean().round(0)

621581.0

In [183]:
df_2.loc[df_2['farther_half_of_houses'] == 0, 'price'].mean().round(0)

458791.0

Excellent. Observations confirmed. Now to get some sleep!