# Some ground rules for the assignments:

For all assignments (this one and any future assignment including the final project): 

* Do not download and save the data locally unless your data is very big (~TBs). I **do not** want to see you opening a local file for the data I gave you as an URL. Anything local is suspect: local files can't be trusted (they might be manipulated, changed, modified, tempered.) Refer to my lecture notes on how to pull data from an URL using `urlopen`.

* All computations must be done locally within python. Nothing external: no manual input, no excel, no SQL, no java, etc.

* All code has to be explained. Explain your reasoning and your choices. If you installed a third party library (including `numpy`, `scipy`, `pandas` etc), explain which part you import and what that function does etc. 

* Explain your code using a markdown cell. **Do not** use code comments starting with `#` to do your explanations.

* Do not use `if __name__ == "__main__"`. EVER! If you are using that within jupyter, I am going to assume you found the solution on the internet and you cut/pasted without understanding what that peice of code did.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import json
import networkx as nx
import seaborn as sns

from urllib.request import Request, urlopen
from urllib.parse import urlencode
from collections import Counter
from zipfile import ZipFile
from io import BytesIO
from xmltodict import parse

# Question 1

Istanbul municipality has a open data service, and it provides detailed information about their services. For this question, use the data given at [this link](https://data.ibb.gov.tr/dataset/istanbul-sehir-hatlari-iskeleleri).

1. Understand what the data is for. Explain what it is, what it records, what pieces it has in broad strokes. 

2. Data (among other things) contains geographical locations of Sea Stations ('Iskele') of Istanbul Deniz Isletmeleri boats operating in Istanbul.  Extract the locations of these stations as a pandas dataframe with 2 columns: latitudes and longitudes. The index of the dataframe has to be the station names.

The data is in XML format. Do not use any external libraries other than numpy and pandas. Use `xmltodict` to convert it into a dictionary then extract the necessary parts.

In [3]:
with urlopen('https://data.ibb.gov.tr/dataset/b47b5391-bcca-4bb3-a575-8ece68901d5d/resource/bcbeff5d-14d7-4ec0-a211-4c4d3e96cfba/download/istanbul-ehir-hatlar-iskeleleri.kml') as u:
    raw = parse(u.read().decode('utf8'))
    
raw

OrderedDict([('kml',
              OrderedDict([('@xmlns', 'http://www.opengis.net/kml/2.2'),
                           ('@xmlns:gx', 'http://www.google.com/kml/ext/2.2'),
                           ('@xmlns:kml', 'http://www.opengis.net/kml/2.2'),
                           ('@xmlns:atom', 'http://www.w3.org/2005/Atom'),
                           ('Document',
                            OrderedDict([('name', 'SHI İSKELELER.kml'),
                                         ('StyleMap',
                                          [OrderedDict([('@id',
                                                         'msn_marina23'),
                                                        ('Pair',
                                                         [OrderedDict([('key',
                                                                        'normal'),
                                                                       ('styleUrl',
                                                            

In [30]:
res = []
for x in raw['kml']['Document']['Folder']['Folder']:
    res.extend(x['Placemark'])
    
res

[OrderedDict([('name', 'MALTEPE'),
              ('LookAt',
               OrderedDict([('gx:TimeStamp',
                             OrderedDict([('when', '2020-06-16')])),
                            ('gx:ViewerOptions',
                             OrderedDict([('gx:option',
                                           [OrderedDict([('@name',
                                                          'historicalimagery')]),
                                            OrderedDict([('@enabled', '0'),
                                                         ('@name',
                                                          'sunlight')]),
                                            OrderedDict([('@enabled', '0'),
                                                         ('@name',
                                                          'streetview')])])])),
                            ('longitude', '29.13060758098593'),
                            ('latitude', '40.91681013544846'),
      

In [25]:
names = []
for x in res:
    names.append(x['name'])
names

['MALTEPE',
 'AHIRKAPI',
 'BEŞİKTAŞ-1',
 'BEŞİKTAŞ-2',
 'BOSTANCI',
 'EMİNÖNÜ-1',
 'EMİNÖNÜ-2',
 'EMİNÖNÜ-3',
 'EMİNÖNÜ-4',
 'HAYDARPAŞA',
 'KABATAŞ',
 'KADIKÖY-1',
 'KADIKÖY-2',
 'KARAKÖY',
 'KARAKÖY-2',
 'MODA',
 'ÜSKÜDAR',
 'AYVANSARAY',
 'BALAT',
 'EMİNÖNÜ HALİÇ',
 'EYÜP SULTAN',
 'FENER',
 'HASKÖY',
 'KASIMPAŞA',
 'SÜTLÜCE',
 'BURGAZADA',
 'BÜYÜKADA',
 'HEYBELİADA',
 'KINALIADA',
 'SEDEF ADASI',
 'ANADOLU HİSARI',
 'ANADOLU KAVAĞI',
 'ARNAVUTKÖY',
 'BEBEK',
 'BEYKOZ',
 'BEYLERBEYİ',
 'BÜYÜKDERE',
 'ÇENGELKÖY',
 'ÇUBUKLU',
 'EMİRGAN',
 'ÇUBUKLU ARABALI',
 'İSTİNYE',
 'KANDİLLİ',
 'KANLICA',
 'KUZGUNCUK',
 'KÜÇÜKSU',
 'ORTAKÖY',
 'POYRAZ',
 'PAŞABAHÇE',
 'RUMELİ KAVAĞI',
 'SARIYER',
 'YENİKÖY',
 'ÇUBUKLU ARABALI',
 'İSTİNYE ARABALI']

In [33]:
longitudes = []
latitudes = []
for x in res:
    try:
        lo = x['LookAt']['longitude']
        la = x['LookAt']['latitude']
    except:
        lo = x['Camera']['longitude']
        la = x['Camera']['latitude']
    longitudes.append(lo)
    latitudes.append(la)

In [34]:
pd.DataFrame({'Name':names, 'Longitudes': longitudes, 'Latitudes': latitudes})

Unnamed: 0,Name,Longitudes,Latitudes
0,MALTEPE,29.13060758098593,40.91681013544846
1,AHIRKAPI,28.98289668101853,41.00314456999032
2,BEŞİKTAŞ-1,29.00778819900819,41.04116198628195
3,BEŞİKTAŞ-2,29.0055048939288,41.04065414312002
4,BOSTANCI,29.09425745312653,40.95173395654253
5,EMİNÖNÜ-1,28.97621869809887,41.01495987953694
6,EMİNÖNÜ-2,28.97621869809887,41.01495987953694
7,EMİNÖNÜ-3,28.97495985342729,41.01488637107048
8,EMİNÖNÜ-4,28.97495985342729,41.01488637107048
9,HAYDARPAŞA,29.01810215560077,40.99577360085738


# Question 2

For this question we are going to use Istanbul Municipality data at [this link](https://data.ibb.gov.tr/dataset/sehir-hatlari-sefer-sayilari). Data contains Istanbul Deniz Isletmeleri route information.

1. Understand what the data is for, what it records, what it contains. Explain it in broad strokes.
2. Extract the data about the number of trips between stations.
2. Calculate the total number of trips in 2020,
3. Calculate the total number of trips in 2021, 
4. Which is the busiest station in 2020 and 2021?

For these questions you must extract the specific data you need from the raw data. Let me be very clear: I am not interested in the numerical answer, I'd like to see your data extraction and calculation explicitly. I need to see your code with which you extract the data, see the data frame where you record the extracted data, and the code where you group and calculate the required results.

In [143]:
def readIDID(url):
    data = pd.read_csv(url,
                       encoding='iso-8859-9',
                       sep=';',
                       decimal=',',
                       thousands='.')
    
    return data

In [144]:
data2020 = readIDID('https://data.ibb.gov.tr/dataset/4be0d5f6-62de-4a24-a6a5-038009cde39f/resource/f1f95d5d-fa2f-479d-9d50-85ca1d604c1e/download/2020-yl-ehir-hatlar-sefer-saylar.csv')
data2020

Unnamed: 0,YIL,GÜZERGAH,TOPLAM SEFER ADETİ
0,2020,BEŞİKTAŞ - KADIKÖY,26879
1,2020,KADIKÖY - KARAKÖY - BEŞİKTAŞ,13
2,2020,EMİNÖNÜ - ÜSKÜDAR,28441
3,2020,ÜSKÜDAR - KARAKÖY - EMİNÖNÜ,8737
4,2020,KADIKÖY - EMİNÖNÜ,18408
5,2020,KADIKÖY - KARAKÖY,25658
6,2020,KABATAŞ - KADIKÖY - ADALAR - BOSTANCI,5879
7,2020,İSTANBUL - ADALAR,4542
8,2020,KADIKÖY - KARAKÖY - EMİNÖNÜ,11156
9,2020,BOĞAZ GİDİŞ GELİŞ (EMİNÖNÜ - BEŞİKTAŞ -KUZGUN...,523


In [145]:
data2020['TOPLAM SEFER ADETİ'].sum()

193669

In [122]:
data2021 = readIDID('https://data.ibb.gov.tr/dataset/4be0d5f6-62de-4a24-a6a5-038009cde39f/resource/d2c7e4c3-fd09-4952-8a8e-776e3accf91d/download/2021-yl-ehir-hatlar-sefer-saylar.csv')
data2021['Toplam Sefer Adeti'].sum()

177882.0

In [123]:
data2021

Unnamed: 0,Yil,Guzergah,Toplam Sefer Adeti
0,2021.0,BEŞİKTAŞ-KADIKÖY,23658.0
1,2021.0,EMİNÖNÜ-ÜSKÜDAR,23854.0
2,2021.0,EMİNÖNÜ-KADIKÖY,18298.0
3,2021.0,EMİNÖNÜ-BEŞİKTAŞ-KUZGUNCUK-BEYLERBEYİ-ÇENGELKÖ...,497.0
4,2021.0,EMİNÖNÜ-BEŞİKTAŞ-ORTAKÖY-EMİRGAN-PAŞABAHÇE-BEY...,545.0
...,...,...,...
70,,,
71,,,
72,,,
73,,,


In [150]:
data2020.iloc[data2020['TOPLAM SEFER ADETİ'].idxmax(),:]

YIL                                2020
GÜZERGAH              EMİNÖNÜ - ÜSKÜDAR
TOPLAM SEFER ADETİ                28441
Name: 2, dtype: object

In [151]:
data2021.iloc[data2021['Toplam Sefer Adeti'].idxmax(),:]

Yil                            2021.0
Guzergah              EMİNÖNÜ-ÜSKÜDAR
Toplam Sefer Adeti            23854.0
Name: 1, dtype: object

In [68]:
23854/(18*365)

3.630745814307458

# Question 3

Using a different [dataset](https://data.ibb.gov.tr/dataset/istanbul-deniz-iskeleleri-yolcu-sayilari) again from Istanbul Municipality on Istanbul Deniz Isletmeleri: 

1. Understand what the data is for, what it records, what it contains. Explain it in broad strokes.
2. Find out the busiest station in the years 2020 and 2021,
3. Repeat the same calculation monthly: find the busiest stations for each month.
4. Does your calculation of busiest stations agree with the calculation you made in Question #2? Explain.


In [49]:
data = pd.read_csv('https://data.ibb.gov.tr/dataset/20f33ff0-1ab3-4378-9998-486e28242f48/resource/6fbdd928-8c37-43a4-8e6a-ba0fa7f767fb/download/istanbul-deniz-iskeleleri-yolcu-saylar.csv',
                   sep=';',
                   encoding='iso-8859-9')
data

Unnamed: 0,Yil,Ay,Otorite Adi,Istasyon Adi,Yolcu Sayisi
0,2021,3,Beyden Deniz Ulaşım Hizmetleri Turizm ve Tic. ...,BEYKOZ,5076
1,2021,3,Beyden Deniz Ulaşım Hizmetleri Turizm ve Tic. ...,YENIKOY,5347
2,2021,3,DENTUR - AVRASYA DENİZ TAŞ.TUR.HİZ.İNŞ.SAN.TİC...,BESIKTAS,106334
3,2021,3,DENTUR - AVRASYA DENİZ TAŞ.TUR.HİZ.İNŞ.SAN.TİC...,KABATAS,24
4,2021,3,DENTUR - AVRASYA DENİZ TAŞ.TUR.HİZ.İNŞ.SAN.TİC...,USKUDAR,94200
...,...,...,...,...,...
656,2021,11,TURYOL - S.S. TURİZM VE YOLCU DENİZ TAŞIYICILA...,Eminönü,55387
657,2021,11,TURYOL - S.S. TURİZM VE YOLCU DENİZ TAŞIYICILA...,Kadıköy Balon,40680
658,2021,11,TURYOL - S.S. TURİZM VE YOLCU DENİZ TAŞIYICILA...,Kadıköy Çayırbaşı,69443
659,2021,11,TURYOL - S.S. TURİZM VE YOLCU DENİZ TAŞIYICILA...,Karaköy,55098


In [69]:
data[data['Yil'] == 2021]

Unnamed: 0,Yil,Ay,Otorite Adi,Istasyon Adi,Yolcu Sayisi
0,2021,3,Beyden Deniz Ulaşım Hizmetleri Turizm ve Tic. ...,BEYKOZ,5076
1,2021,3,Beyden Deniz Ulaşım Hizmetleri Turizm ve Tic. ...,YENIKOY,5347
2,2021,3,DENTUR - AVRASYA DENİZ TAŞ.TUR.HİZ.İNŞ.SAN.TİC...,BESIKTAS,106334
3,2021,3,DENTUR - AVRASYA DENİZ TAŞ.TUR.HİZ.İNŞ.SAN.TİC...,KABATAS,24
4,2021,3,DENTUR - AVRASYA DENİZ TAŞ.TUR.HİZ.İNŞ.SAN.TİC...,USKUDAR,94200
...,...,...,...,...,...
656,2021,11,TURYOL - S.S. TURİZM VE YOLCU DENİZ TAŞIYICILA...,Eminönü,55387
657,2021,11,TURYOL - S.S. TURİZM VE YOLCU DENİZ TAŞIYICILA...,Kadıköy Balon,40680
658,2021,11,TURYOL - S.S. TURİZM VE YOLCU DENİZ TAŞIYICILA...,Kadıköy Çayırbaşı,69443
659,2021,11,TURYOL - S.S. TURİZM VE YOLCU DENİZ TAŞIYICILA...,Karaköy,55098


In [70]:
np.unique(data['Ay'])

array([ 3,  4,  5,  6,  7,  8,  9, 10, 11])

In [110]:
res = data.groupby('Istasyon Adi').sum()
res['Yolcu Sayisi'].idxmax()

'USKUDAR'

In [118]:
[(x[1].groupby('Istasyon Adi').sum()['Yolcu Sayisi']).idxmax() for x in data.groupby('Ay')]

['USKUDAR',
 'USKUDAR',
 'USKUDAR',
 'USKUDAR',
 'USKUDAR',
 'USKUDAR',
 'USKUDAR',
 'USKUDAR',
 'USKUDAR']

In [88]:
help(res.idxmax)

Help on method idxmax in module pandas.core.frame:

idxmax(axis: 'Axis' = 0, skipna: 'bool' = True) -> 'Series' method of pandas.core.frame.DataFrame instance
    Return index of first occurrence of maximum over requested axis.
    
    NA/null values are excluded.
    
    Parameters
    ----------
    axis : {0 or 'index', 1 or 'columns'}, default 0
        The axis to use. 0 or 'index' for row-wise, 1 or 'columns' for column-wise.
    skipna : bool, default True
        Exclude NA/null values. If an entire row/column is NA, the result
        will be NA.
    
    Returns
    -------
    Series
        Indexes of maxima along the specified axis.
    
    Raises
    ------
    ValueError
        * If the row/column is empty
    
    See Also
    --------
    Series.idxmax : Return index of the maximum element.
    
    Notes
    -----
    This method is the DataFrame version of ``ndarray.argmax``.
    
    Examples
    --------
    Consider a dataset containing food consumption in Arg

In [90]:
help(np.where)

Help on function where in module numpy:

where(...)
    where(condition, [x, y], /)
    
    Return elements chosen from `x` or `y` depending on `condition`.
    
    .. note::
        When only `condition` is provided, this function is a shorthand for
        ``np.asarray(condition).nonzero()``. Using `nonzero` directly should be
        preferred, as it behaves correctly for subclasses. The rest of this
        documentation covers only the case where all three arguments are
        provided.
    
    Parameters
    ----------
    condition : array_like, bool
        Where True, yield `x`, otherwise yield `y`.
    x, y : array_like
        Values from which to choose. `x`, `y` and `condition` need to be
        broadcastable to some shape.
    
    Returns
    -------
    out : ndarray
        An array with elements from `x` where `condition` is True, and elements
        from `y` elsewhere.
    
    See Also
    --------
    choose
    nonzero : The function that is called when x an