# German Ebay Car Sales 2016

The Data-Set contains offerings of used cars in germany (20 attributes, 371528 examples). These cars where submitted to the website 'ebay Kleinanzeigen' and have been crawled between 2016-03-05 and 2016-04-07.

The dataset can be found here:
https://www.kaggle.com/orgesleka/used-cars-database

### Importing a crap tone of libraries

If you are running this for the first time, you will need to install alot of applications for the libraries that follow. You can run this in your python console, be wary that some may take a while.

In [None]:
pip install wheel
pip install squarify
pip install pygal
pip install pywaffle
pip install pipwin
pip install plotly
pipwin install numpy
pipwin install pandas
pipwin install shapely
pipwin install gdal
pipwin install fiona
pipwin install pyproj
pipwin install six
pipwin install rtree
pipwin install descartes
pipwin install geopandas

In [170]:
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import os
import seaborn as sns
from sklearn import preprocessing
import squarify
import geopandas as gp
import shapely
import fiona
import pygal
from pywaffle import Waffle
import plotly.express as px

%matplotlib inline
plt.rcParams['figure.figsize'] = (20, 10)
saved_style_state = matplotlib.rcParams.copy()


### Getting the Files
This is just a small code to pull up the csv file. It will print "File not found!" if the file isnt found in the OpenDataProject Directory


In [171]:
#Getting the Car data File
filepath = "autos.csv"
cardata= pd.read_csv(filepath)
print("Successfully Loaded Car CSV")

#Getting the Map Files
shp_file_name = "plz-gebiete.shp"
germanburbs = gp.GeoDataFrame.from_file(shp_file_name)
print("Successfully Loaded Map Shape File")

Successfully Loaded Car CSV
Successfully Loaded Map Shape File


Here is a view of the dataset. It was created using a webscraper on ebay germany. I found an explanation of the columns online as there was a couple weird entries in german, and odd coulumns including 'abtest'

- dateCrawled         : when advert was first crawled, all field-values are taken from this date \n
- name                : headline, which the owner of the car gave to the advert \n
- seller              : 'privat'(ger)/'private'(en) or 'gewerblich'(ger)/'dealer'(en) \n
- offerType           : 'Angebot'(ger)/'offer'(en) or 'Gesuch'(ger)/'request'(en)
- price               : the price on the advert to sell the car
- abtest              : ebay-intern variable (argumentation in discussion-section)
- vehicleType         : one of eight vehicle-categories 
- yearOfRegistration  : at which year the car was first registered
- gearbox             : 'manuell'(ger)/'manual'(en) or 'automatik'(ger)/'automatic'(en)
- powerPS             : the power of the car in PS
- model               : the cars model
- kilometer           : how many kilometres the car has driven
- monthOfRegistration : at which month the car was first registered
- fuelType            : one of seven fuel-categories
- brand               : the cars brand
- notRepairedDamage   : if the car has a damage which is not repaired yet
- dateCreated         : the date for which the advert at 'ebay Kleinanzeigen' was created
- nrOfPictures        : number of pictures in the advert
- postalCode          : where in germany the car is located
- lastSeenOnline      : when the crawler saw this advert last online



In [None]:
cardata.head(10)

## Simple Car Sales Comparisons

Below are some initial data analysis from simple characteristics provided in the german car data csv

In [None]:
df = cardata["gearbox"]

man_num = df.str.count("manuell").sum()
aut_num = df.str.count("automatik").sum()

pie1 = pd.DataFrame({'Car Gearbox Types': [int(man_num), int(aut_num)]},
                  index=['Automatic Cars', 'Manual Cars'])
plot = pie1.plot.pie(y='Car Gearbox Types', figsize=(5, 5))
plt.title("Car Gearbox Types", fontsize=20)

In [None]:
df = cardata["fuelType"]

diesel_num = df.str.count("diesel").sum()
petrol_num = df.str.count("benzin").sum()


pie1 = pd.DataFrame({'Car Fuel Types': [int(diesel_num), int(petrol_num)]},
                  index=['Diesel Cars', 'Petrol Cars'])
plot = pie1.plot.pie(y='Car Fuel Types', figsize=(5, 5))
plt.title("Car Fuel Type", fontsize=20)

In [None]:
cardata["yearOfRegistration"][cardata["yearOfRegistration"] >1990 ][cardata["yearOfRegistration"] < 2020].hist()

plt.xticks(fontsize=10, rotation='90')
plt.title("Car Manufacture Year", fontsize=20)
plt.xlabel('\n Year of Manufacture', fontsize=15)
plt.ylabel('Quantity of Cars', fontsize=15, rotation='vertical', ha='right')



In [None]:
cardata["price"][cardata["price"] <30000 ].hist()

plt.xticks(fontsize=10, rotation='90')
plt.title("Car Prices", fontsize=20 )
plt.xlabel('\n Price', fontsize=15 )
plt.ylabel('Quantity of Cars', fontsize=15, rotation='vertical', ha='right')

In [None]:
cardata["vehicleType"].value_counts().plot(kind="bar")

plt.xticks(fontsize=10, rotation='90')
plt.title("Car Types", fontsize=20 )
plt.xlabel('\n Car Types in German', fontsize=15 )
plt.ylabel('Quantity of Cars', fontsize=15, rotation='vertical', ha='right')

In [None]:
cardata["model"].value_counts().plot(kind="bar")

plt.xticks(fontsize=10, rotation='90')
plt.title("Car Model Frequency", fontsize=20 )
plt.xlabel('\n Models', fontsize=15 )
plt.ylabel('Quantity of Cars', fontsize=15, rotation='vertical', ha='right')

In [None]:
cardata["powerPS"][cardata["powerPS"] <400 ].hist()

plt.xticks(fontsize=10, rotation='90')
plt.title("Car Power", fontsize=20 )
plt.xlabel('\n Car Horsepower', fontsize=15)
plt.ylabel('Quantity of Cars', fontsize=15, rotation='vertical', ha='right')

In [None]:
cardata["brand"].value_counts().plot(kind="bar")

plt.xticks(fontsize=10, rotation='90')
plt.title("Car Manufactures", fontsize=20 )
plt.xlabel('\n Manufacturer', fontsize=15)
plt.ylabel('Quantity of Cars', fontsize=15, rotation='vertical', ha='right')

In [None]:
data = dict(zip(parent_companies, parent_count))
fig = plt.figure(FigureClass=Waffle,
                 figsize=(18,8),
                 rows=15,
                 columns=25,
                 values=parent_count,
                 title={'label': 'Car Types',
                        'loc': 'left',
                        'fontdict': {'fontsize': 20}},
                 labels=["{} ({:,})".format(k, v) for k, v in data.items()],
                 legend={'loc': 'lower left',
                         'bbox_to_anchor': (0, -0.4),
                         'ncol': 6,
                         'framealpha': 0,
                         'fontsize': 12})

## Car Brand Analysis

I found the car brands particularly interesting so in this section i filteres the brand data from the original csv and created a new dataset, and made some interesting graphs and comparisons

I am going to make a graph of car sales on mannufactures parent companies. Below is a list of car brands and their corresponding parent companies

- toyota : palexus, daihatsu, toyota
- general motors : chevrolet, buic, cadillac, holden, hsv
- volkswagen : bentley, skoda, audi, lamborghini, bugatti, porshe, volkswagen
- fiat chrysler automobiles : jeep, fiat, dodge, abarth, lancia, alpha_romeo, chrysler
- psa group : peugeot, citreon, opel, vauxhall
- daimler : mercedes_benz, smart, maybach
- bmw : bmw, mini, rolls_royce
- group renault : nissan, infiniti, mitsubishi, renault
- Tata motors : land rover, jaguar
- hyundai : hyundai, kia, genesis
- geely : volvo, lotus
- fuji heavy industries : subaru
- independants : everyone else

In [None]:
#making some ugly functions
def brandcounter(braand):
    sum = cardata["brand"].str.count(braand).sum()
    return sum

def parentcounterlists(listofbrands):
    brandlist = []
    for i in listofbrands:
        brandlist.append(brandcounter(i))
        sum1 = sum(brandlist)
    return sum1

def parentcounterlistsoflists(listlistofbrands):
    brandlist = []
    for i in listlistofbrands:
        brandlist.append(parentcounterlists(i))
    return brandlist
#parent company and car lists
parent_companies = [
"Toyota", 
"General Motors", 
"Volkswagen",
"Fiat\nChrysler\nAutomobiles", 
"PSA Group", 
"Daimler",
"BMW",
"Group Renault",
"Tata Motors",
"Hyundai",
"Geely", 
"Fuji\nHeavy\nIndustries", 
"Independants" 
]

parent_company_cars = [
["palexus", "daihatsu", "toyota"],
["chevrolet", "buic", "cadillac", "holden", "hsv"],
["bentley", "skoda", "audi", "lamborghini", "bugatti", "porshe", "volkswagen"],
["jeep", "fiat", "dodge", "abarth", "lancia", "alpha_romeo", "chrysler"],
["peugeot", "citroen", "opel", "vauxhall"],
["mercedes_benz", "smart", "maybach"],
["bmw", "mini", "rolls_royce"],
["nissan", "infiniti", "mitsubishi", "renault"],
["land rover", "jaguar"],
["hyundai", "kia", "genesis"],
["volvo", "lotus"],
["subaru"]
]
#Making a new clean dataset 

# This was done by searching brands from parent_company_cars and counting them in the german car data csv

parent_count = list(parentcounterlistsoflists(parent_company_cars))

independant_count = len(cardata["brand"]) - sum(parent_count)
parent_count.append(independant_count)  
    
#Reorganising Data for another Graph
data = cardata["brand"].value_counts()
data1 = data.index.tolist()
data2 = data.tolist()
data3 = []
for i in data1:

    if i in (item for sublist in parent_company_cars for item in sublist):
        f = next(((j, car.index(i))
              for j, car in enumerate(parent_company_cars)
              if i in car),
             None)
        f = f[0]
        data3.append(parent_companies[f])
    else:
        data3.append('Independant') 

brand = pd.DataFrame({'Parent Companies': data3 , 'Quantity of Cars': data2 , 'Car Brand': data1})


In [None]:
#Finally Plotting it
df = pd.DataFrame({'Parent Companies': parent_companies , 'Quantity of Cars': parent_count})
df.sort_values('Quantity of Cars',inplace= True, ascending = False)
ax = df.plot.bar(x='Parent Companies',  rot=0, width = 0.5)
plt.xlabel('\n Parent Company', fontsize=15)
plt.ylabel('Quantity of Cars', fontsize=15, rotation='vertical', ha='right')
plt.title("Car Family Manufactures", fontsize=20 )

I found the representation of this data in the above table quite simplistic so i explored some different graphs including some cool things called Treemaps :)

In [None]:
brand.head()

In [None]:
plt.figure(figsize=(20,10))
squarify.plot(sizes=parent_count,
              color=['#221A7C','#202785','#1D348E','#1B4297', '#184FA0', '#165CA9'  , '#1369B2', '#0979B9',"#0089c0", '#0095C2', '#00A2C4', '#00AEC5', \
         '#00BAC7', '#16C0C5','#2BC5C3', '#41CBC0', '#56D0BE', '#6CD6BC', '#77DABB',  '#83DEBA', '#8EE1B8',  '#9AE5B7', \
          '#A5E9B6', '#B0EDB5', '#BCF0B3', '#C7F4B2', '#CCF5B4', '#D1F7B6', '#DCFAB9', '#E6FCBD', '#F0FFC0'],
              label=parent_companies,
              pad=True)

plt.title('Treemap', fontsize=20 )
plt.axis('off');

In [None]:
data = dict(zip(parent_companies, parent_count))
fig = plt.figure(FigureClass=Waffle,
                 figsize=(18,8),
                 rows=15,
                 columns=25,
                 values=parent_count,
                 title={'label': 'Car Manufacturer by Parent Companies',
                        'loc': 'left',
                        'fontdict': {'fontsize': 20}},
                 labels=["{} ({:,})".format(k, v) for k, v in data.items()],
                 legend={'loc': 'lower left',
                         'bbox_to_anchor': (0, -0.4),
                         'ncol': 6,
                         'framealpha': 0,
                         'fontsize': 12})

In [None]:
data = dict(zip(data1,data2))
fig = plt.figure(FigureClass=Waffle,
                 figsize=(18,8),
                 rows=15,
                 columns=25,
                 values=data2,
                 title={'label': 'Car Manufacturers',
                        'loc': 'left',
                        'fontdict': {'fontsize': 20}},
                 labels=["{} ({:,})".format(k, v) for k, v in data.items()],
                 legend={'loc': 'lower left',
                         'bbox_to_anchor': (0, -0.4),
                         'ncol': 6,
                         'framealpha': 0,
                         'fontsize': 12})

In [None]:
df1 = pd.pivot_table(brand, values='Quantity of Cars', index='Parent Companies', columns='Car Brand')
df1

In [None]:
df1 = pd.pivot_table(brand, values='Quantity of Cars', index='Parent Companies', columns='Car Brand')
df1.plot(kind='bar', stacked=True).legend(bbox_to_anchor=(1.2, 1))
plt.xlabel('\n Parent Company', fontsize=15)
plt.ylabel('Quantity of Cars', fontsize=15, rotation='vertical', ha='right')
plt.title("Car Family Manufactures", fontsize=20 )
#rectangle artists


In [None]:
#I tried sorting it, but it took too long
# sort = brand.groupby('Parent Companies').sum().sort_values('Quantity of Cars', ascending = False).index
# df1 = pd.pivot_table(brand, values='Quantity of Cars', index='Parent Companies', columns='Car Brand').reindex(sort_list) 
# df1.plot(kind='bar', stacked=True).legend(bbox_to_anchor=(1.2, 1))
# plt.xlabel('\n Parent Company', fontsize=15)
# plt.ylabel('Quantity of Cars', fontsize=15, rotation='vertical', ha='right')
# plt.title("Car Family Manufactures", fontsize=20 )

In [None]:
brand.head()

In [None]:
cardata["dateCreated"].value_counts().plot(kind="bar")

# Maps

I found a map shape file of Germany that included corresponding postal codes that i could link to my car data. I found the shape file here (the site is in German, you might want to translate it):
https://www.suche-postleitzahl.org/downloads



Probably dont run the following cell as it takes a while, but you will see a massive shape file of germany

In [None]:
germanburbs.geometry.plot()

In [None]:
germanburbs.head()

## Adding Car data to the Map
Trying to add a count of sales and other information in each postcode. I aim to further add other characteristics like comparing the most prominant car brands regionally

In [None]:
#Really rough distribution of car sales via postal code
df = cardata["postalCode"]
df.hist()

In [172]:
#This is a function that will run from a lamda that will count the number of cars sold in each postcode
cardata["postalCode"] = cardata["postalCode"].astype(str)
counts = cardata['postalCode'].value_counts().rename_axis('postalCode').reset_index(name='count')
plist = counts['postalCode'].tolist()
def pccounter(pcode):
    pcode = str(pcode)
    if pcode in plist:
        num = counts.loc[counts['postalCode'] == pcode, 'count'].iloc[0]
        return num
    else:
        return 0


In [173]:
#This is the lambda function that adds the count column
germanburbs['count'] = germanburbs.apply(lambda x: pccounter(x['plz']), axis=1)


In [176]:

# I had to write all these down manually as i got the error when trying to make a function that alters a dataframe
# 'DataFrame' objects are mutable, thus they cannot be hashed

#Resorting Data into mean and modes
df = cardata
df['postalCode'] = df['postalCode'].astype(str)

price = df.groupby("postalCode").agg({"price":['mean']}).reset_index()
price['price'] = price['price'].astype(int)
price.columns = price.columns.droplevel(1)

gear = df.groupby("postalCode")['gearbox'].agg(pd.Series.mode)

brand1 = df.groupby("postalCode")['brand'].agg(pd.Series.mode)

fuel = df.groupby("postalCode")['fuelType'].agg(pd.Series.mode)

vtype = df.groupby("postalCode")['vehicleType'].agg(pd.Series.mode)

power = df.groupby("postalCode").agg({"powerPS":['mean']}).reset_index()
power['powerPS'] = power['powerPS'].astype(int)
power.columns = power.columns.droplevel(1)

regoY = df.groupby("postalCode").agg({"yearOfRegistration":['mean']}).reset_index()
regoY['yearOfRegistration'] = regoY['yearOfRegistration'].astype(int)
regoY.columns = regoY.columns.droplevel(1)

regoM = df.groupby("postalCode").agg({"monthOfRegistration":['mean']}).reset_index()
regoM['monthOfRegistration'] = regoM['monthOfRegistration'].astype(int)
regoM.columns = regoM.columns.droplevel(1)

#This is a function that will run from a lamda that will transfer the data above into the map data

counts = cardata['postalCode'].value_counts().rename_axis('postalCode').reset_index(name='count')
plist = counts['postalCode'].tolist()
def pccounter2(pcode,df,column):
    pcode = str(pcode)
    if pcode in plist:
        e = df.loc[df['postalCode'] == pcode,column].iloc[0]
        num = int(e)
        return num
    else:
        return 0
def pcmode(pcode,df1):
    pcode = str(pcode)
    if pcode in plist:
        e = df1[pcode]
        return e
    else:
        return 0

# These are some lambda functions
germanburbs['Gearbox'] = germanburbs.apply(lambda x: pcmode(x['plz'],gear), axis=1)
germanburbs['Car Brand'] = germanburbs.apply(lambda x: pcmode(x['plz'],brand1), axis=1)
germanburbs['Fuel Type'] = germanburbs.apply(lambda x: pcmode(x['plz'],fuel), axis=1)
germanburbs['Vehicle Type'] = germanburbs.apply(lambda x: pcmode(x['plz'],vtype), axis=1)
germanburbs['Car Price Av'] = germanburbs.apply(lambda x: pccounter2(x['plz'],price,'price'), axis=1)
germanburbs['Car Power Av'] = germanburbs.apply(lambda x: pccounter2(x['plz'],power,'powerPS'), axis=1)
germanburbs['Rego Year'] = germanburbs.apply(lambda x: pccounter2(x['plz'],regoY,'yearOfRegistration'), axis=1)
germanburbs['Rego Month'] = germanburbs.apply(lambda x: pccounter2(x['plz'],regoM,'monthOfRegistration'), axis=1)
                                        

In [177]:
germanburbs.sample(10)

Unnamed: 0,plz,note,geometry,count,Gearbox,Car Brand,Fuel Type,Vehicle Type,Car Price Av,Car Power Av,Rego Year,Rego Month
3293,71732,71732 Tamm,"POLYGON ((9.08649 48.92112, 9.08715 48.92245, ...",40,manuell,"[mercedes_benz, volkswagen]",benzin,kleinwagen,6058,112,2002,5
6200,82288,82288 Kottgeisering,"POLYGON ((11.10298 48.12622, 11.10319 48.12664...",2,manuell,"[bmw, skoda]",benzin,"[cabrio, limousine]",5899,136,2004,7
1834,49088,49088 OsnabrÃ¼ck,"POLYGON ((8.03624 52.29246, 8.03793 52.29238, ...",96,manuell,opel,benzin,limousine,4264,104,2003,5
4836,97490,97490 Poppenhausen,"POLYGON ((10.07086 50.10310, 10.07124 50.10403...",22,manuell,"[audi, mercedes_benz, volkswagen]",benzin,kombi,2788,121,2001,5
6310,39638,39638 Gardelegen,"POLYGON ((11.20265 52.56187, 11.20295 52.56281...",69,manuell,volkswagen,benzin,limousine,4468,110,2001,5
3308,63697,63697 Hirzenhain,"POLYGON ((9.09359 50.42780, 9.09727 50.43094, ...",15,manuell,bmw,benzin,"[kombi, limousine]",3587,142,2001,3
8421,12623,12623 Berlin Mahlsdorf,"POLYGON ((13.57863 52.48266, 13.57905 52.48386...",108,manuell,volkswagen,benzin,limousine,5484,106,2003,6
2888,75391,75391 Gechingen,"POLYGON ((8.78944 48.67646, 8.79000 48.67685, ...",19,manuell,volkswagen,benzin,kleinwagen,4602,101,2003,5
3243,31711,31711 Luhden,"POLYGON ((9.06200 52.22324, 9.06202 52.22341, ...",7,manuell,ford,benzin,kleinwagen,4520,68,2009,5
2480,78652,78652 DeiÃlingen,"POLYGON ((8.54282 48.11367, 8.54320 48.11481, ...",21,manuell,bmw,benzin,limousine,8572,97,2003,5


### Sorting Done
Now From my germanburbs data i am able to tell the avcar price, the most common car type, brand, gearbox etc on any postcode

In [178]:
germanburbs.sample(10)

Unnamed: 0,plz,note,geometry,count,Gearbox,Car Brand,Fuel Type,Vehicle Type,Car Price Av,Car Power Av,Rego Year,Rego Month
5285,91567,91567 Herrieden,"POLYGON ((10.38095 49.19798, 10.38129 49.19813...",23,manuell,"[bmw, opel]",benzin,limousine,4954,137,2001,5
6935,93326,93326 Abensberg,"POLYGON ((11.79339 48.84403, 11.79958 48.84766...",82,manuell,audi,benzin,limousine,7318,131,2003,5
4795,73433,73433 Aalen,"POLYGON ((10.05072 48.86170, 10.05118 48.87245...",74,manuell,bmw,benzin,limousine,4955,118,2003,4
6243,91358,91358 Kunreuth,"POLYGON ((11.13436 49.68309, 11.13531 49.68310...",4,manuell,seat,benzin,limousine,9474,218,2006,6
178,54587,54587 Lissendorf,"POLYGON ((6.53603 50.31115, 6.54376 50.31623, ...",6,manuell,volkswagen,benzin,kleinwagen,3272,128,2001,4
3734,34131,34131 Kassel,"POLYGON ((9.35008 51.30687, 9.35045 51.30701, ...",49,manuell,volkswagen,benzin,kleinwagen,8997,139,2004,5
4602,89075,89075 Ulm,"POLYGON ((9.93620 48.41524, 9.93821 48.41590, ...",133,manuell,volkswagen,benzin,limousine,5721,114,2003,6
1218,58509,58509 LÃ¼denscheid,"POLYGON ((7.58223 51.21556, 7.58260 51.21772, ...",117,manuell,volkswagen,benzin,kleinwagen,6493,98,2004,5
4633,31135,31135 Hildesheim,"POLYGON ((9.95219 52.18412, 9.96287 52.18540, ...",106,manuell,volkswagen,benzin,limousine,3463,107,2003,7
2856,32545,32545 Bad Oeynhausen,"POLYGON ((8.76951 52.16887, 8.76956 52.16909, ...",118,manuell,opel,benzin,kleinwagen,3637,91,2000,3


## Regional Car Sales 
Below is the plots of regional car sales analysis. There was alot of areas, so i chose to focus around the postcode with the most sales which happened to be a place called Bochum.

In [None]:
germanburbs.plot(column='count', cmap='cool', legend=True)

In [None]:
a = germanburbs.iloc[0]
print(a)
a.geometry

In [None]:
def add_centroid(row):
    return row.geometry.centroid

germanburbs["centroid"] = germanburbs.apply(add_centroid, axis=1)

In [None]:
#I found the Purple bit really interesting so i went to inspect it further
x = germanburbs["count"].nlargest(1).index[0]
a = germanburbs.iloc[x]
print(a.centroid)
a.centroid
print(a)


In [None]:
right_here = shapely.geometry.point.Point(8.270924555201304, 49.98808464340505)
germanburbs["distance_from_mainz"] = burbs.geometry.distance(right_here)
close_burbs = germanburbs[germanburbs.distance_from_mainz<0.2]
close_burbs.plot(column='count', cmap='cool', legend=True)

In [None]:
germanburbs.distance_from_mainz.hist(bins=50);

In [None]:
right_here = shapely.geometry.point.Point(13.3846772483033, 52.53213958289162)
germanburbs["distance_from_berlinmitte"] = burbs.geometry.distance(right_here)
closer_burbs = germanburbs[germanburbs.distance_from_berlinmitte<0.1]
closer_burbs.plot(column='count', cmap='cool', legend=True);

In [None]:
closer_burbs[germanburbs.distance_from_berlinmitte<0.5]
closer_burbs.plot(column='count', cmap='cool', legend=True);

In [None]:
# Make sure you read postal codes as strings, otherwise 
# the postal code 01110 will be parsed as the number 1110. 



plt.rcParams['figure.figsize'] = [16, 11]

# Get lat and lng of Germany's main cities. 
top_cities = {
    'Berlin': (13.404954, 52.520008), 
    'Cologne': (6.953101, 50.935173),
    'Düsseldorf': (6.782048, 51.227144),
    'Frankfurt am Main': (8.682127, 50.110924),
    'Hamburg': (9.993682, 53.551086),
    'Leipzig': (12.387772, 51.343479),
    'Munich': (11.576124, 48.137154),
    'Dortmund': (7.468554, 51.513400),
    'Stuttgart': (9.181332, 48.777128),
    'Nuremberg': (11.077438, 49.449820),
    'Hannover': (9.73322, 52.37052)
}

fig, ax = plt.subplots()

germanburbs.plot(ax=ax, color='green', alpha=0.8)

# Plot cities. 
for c in top_cities.keys():
    # Plot city name.
    ax.text(
        x=top_cities[c][0], 
        # Add small shift to avoid overlap with point.
        y=top_cities[c][1] + 0.08, 
        s=c, 
        fontsize=12,
        ha='center', 
    )
    # Plot city location centroid.
    ax.plot(
        top_cities[c][0], 
        top_cities[c][1], 
        marker='o',
        c='black', 
        alpha=0.5
    )

ax.set(
    title='Germany', 
    aspect=1.3, 
    facecolor='lightblue'
)

Graphs to make
- Car Types
    - Coupes etc on Waffle diagram
- Map of capital cities
    - Top Cars
    - Transmission
    - Sales Price
- Population Graph
    - Normalised Car Sales
    - Population vs Car Sales

_Note: the first `in` means a different thing to the second `in`. I was wondering if I should leave this out, but it's probably good to expose you to strange stuff!_

In [None]:
# This bit makes some random data. Ignore it
mu, sigma = 100, 15; x = mu + sigma*np.random.randn(10000)

In [None]:
# the histogram of the data
plt.hist(x, 50, normed=1, facecolor='green', alpha=0.75)
plt.xlabel('Smarts')
plt.ylabel('Probability')
plt.title(r'$\mathrm{Histogram\ of\ IQ:}\ \mu=100,\ \sigma=15$') # allows for latex formatting
# plt.axis([40, 160, 0, 0.03])
plt.grid(True)
plt.show()

In [None]:
# the histogram of the data
plt.hist(x, 50, density=True, facecolor='green', alpha=0.75)
plt.xlabel('Smarts')
plt.ylabel('Probability')
plt.title(r'$\mathrm{Histogram\ of\ IQ:}\ \mu=100,\ \sigma=15$') # allows for latex formatting
# plt.axis([40, 160, 0, 0.03])
plt.grid(True)
plt.show()

Here's how we made our histogram before:

And this is how we'd change it so that we can add more features:

In [None]:
capped_face_value_data = penalty_data["postalCode"]

plt.hist(capped_face_value_data)
plt.show()

Let's look at some of the things we can do to this. The docs for histograms are here: http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.hist

In [None]:
capped_face_value_data = penalty_data["postalCode"][penalty_data["postalCode"] < 60000]

plt.hist(capped_face_value_data, bins=10, facecolor='blue', alpha=0.2) #<-old one
plt.hist(capped_face_value_data, bins=50, facecolor='green', alpha=1)  #<-new one
plt.show()

We can go back to our initial, unfiltered, data:

In [None]:
number_of_bins = 100
lower_bound = 0
upper_bound = 100000

plt.hist(penalty_data["postalCode"], bins=number_of_bins, range=(lower_bound, upper_bound))
plt.title("Car sale postcode sold between{} and {}".format("2016", "2017"), fontsize=18)
plt.xlabel('German Postcode', fontsize=26)
plt.ylabel('Count', fontsize=26)
plt.grid(True)
plt.show()

This is some straight up, powerful voodoo.

We're grouping the fines by month, and then adding up the groups. Pandas' `groupby` feature allows for all kinds of clever stuff like that.

In [None]:
income = penalty_data[["postalCode","price"]].groupby("postalCode").sum()

plt.xkcd()
plt.plot(income, "x-")
plt.title("Income fom car sales", fontsize=18)
plt.xlabel('Postal Code', fontsize=26)
plt.ylabel('$ Value', fontsize=26)
plt.grid(True)
plt.show()

## The answer is _folding_

_(This is a "pattern")_

In [None]:
def fold(given):
    """Return canonical versions of inputs."""
    
    # Use canonical variables so that you can define once, use many times.
    UNSW_canonical = "uni of stairs"
    ben_name_cannonical = "Ben Doherty"

    # dictionary of input:output pairs
    folds = {
        "University of new south wales": UNSW_canonical,
        "University of New South Wales": UNSW_canonical,
        "University of NSW": UNSW_canonical,
        "UNSW": UNSW_canonical,
        "New-south": UNSW_canonical,
        "BDoh": ben_name_cannonical,
        "Benny": ben_name_cannonical,
        "Mr Dockerty": ben_name_cannonical,
        "Oi, Dickehead!": ben_name_cannonical
    }

#     return folds[given] # needs a defensive part, but ommited for clarity.
    default_value = given
    return folds.get(given, default_value)

print(fold("New-south"))
print(fold("BDoh"))

# _fin_