# Some ground rules for the assignments:

For all assignments (this one and any future assignment including the final project): 

* Do not download and save the data locally unless your data is very big (~TBs). I **do not** want to see you opening a local file for the data I gave you as an URL. Anything local is suspect: local files can't be trusted (they might be manipulated, changed, modified, tempered.) Refer to my lecture notes on how to pull data from an URL using `urlopen`.

* All computations must be done locally within python. Nothing external: no manual input, no excel, no SQL, no java, etc.

* All code has to be explained. Explain your reasoning and your choices. If you installed a third party library (including `numpy`, `scipy`, `pandas` etc), explain which part you import and what that function does etc. 

* Explain your code using a markdown cell. **Do not** use code comments starting with `#` to do your explanations.

* Do not use `if __name__ == "__main__"`. EVER! If you are using that within jupyter, I am going to assume you found the solution on the internet and you cut/pasted without understanding what that peice of code did.

# Question 1

Istanbul municipality has a open data service, and it provides detailed information about their services. For this question, use the data given at [this link](https://data.ibb.gov.tr/dataset/istanbul-sehir-hatlari-iskeleleri).

1. Understand what the data is for. Explain what it is, what it records, what pieces it has in broad strokes. 

2. Data (among other things) contains geographical locations of Sea Stations ('Iskele') of Istanbul Deniz Isletmeleri boats operating in Istanbul.  Extract the locations of these stations as a pandas dataframe with 2 columns: latitudes and longitudes. The index of the dataframe has to be the station names.

The data is in XML format. Do not use any external libraries other than numpy and pandas. Use `xmltodict` to convert it into a dictionary then extract the necessary parts.

## Solution 1.1
The data is to locate the Municipality's ship docks. It has coordinates in it and if fed into a mapper  program (like Google Maps) the information inside can help a person to locate the docks and name them. There is also some additional information, like whether the dock is available or not. 

## Solution 1.2
Three libraries have been imported:
* `urlopen` which enables the program to make an http connection.

* `pandas` since it has a built-in dataframe which looks neat.

* `xmltodict` since I chose to parse the xml file into a collection of ordered dictionaries. I could've parsed the xml file into a tree via `xml.etree.ElementTree` but I realized that I am not comfortable working with trees at this point.
    

In [1]:
from urllib.request import urlopen
import pandas as ps
import xmltodict

The program calls the link that contains the *.kml* file with the `urlopen()` function. The *.kml* file is then parsed into a collection of ordered dictionaries with the `xmltodict.parse` function.  The data that we are looking for is covered by several parent folders, so we define the variable `tum_iskeleler` to be the subject of our program (i.e. all the necessary data is embraced by `tum_iskeleler`). We then create empty lists that will later store the station names, longitudes and latitudes. 

In [2]:
with urlopen("https://data.ibb.gov.tr/dataset/b47b5391-bcca-4bb3-a575-8ece68901d5d/resource/bcbeff5d-14d7-4ec0-a211-4c4d3e96cfba/download/istanbul-ehir-hatlar-iskeleleri.kml") as file1:
    data = xmltodict.parse(file1.read())

data = data['kml']['Document']['Folder']['Folder']; tum_iskeleler = data;
names_list=[]; long_list=[]; lati_list=[];

A nested for loop whose 1st index (i) browses through the _"SHI ISKELELER"_ branch and the 2nd (j) browses through the child of _"SHI ISKELELER"_ called _"Placemark"_. So, inside the loop the computer visits all the station names as well as the latitudes and the longitudes respectively. The station names are collected by the list `names_list`. However, a `try-catch` switch is used to capture the lat.'s and lon.'s since the leafs that represent them happen to differ with an order of 1 parent; that is, in the given data, some coordinates are stored under a *'LookAt'* folder while the others are under *'Camera'*. The latitudes and longitudes are collected by the lists `lati_list` and `long_list`.

In [3]:
for i in range(len(tum_iskeleler)):
    for j in range(len(tum_iskeleler[i]['Placemark'])):
        try:
            names_list.append(tum_iskeleler[i]['Placemark'][j]['name'])
            long_list.append(tum_iskeleler[i]['Placemark'][j]['LookAt']['longitude'])
            lati_list.append(tum_iskeleler[i]['Placemark'][j]['LookAt']['latitude'])
        except:
            long_list.append(tum_iskeleler[i]['Placemark'][j]['Camera']['longitude'])
            lati_list.append(tum_iskeleler[i]['Placemark'][j]['Camera']['latitude'])

Finally, a table is created with `pandas.DataFrame` whose indices consist of station names and whose rows are the latitudes and longitudes of the stations respectively. The table is then transposed wtih pandas' `transpose()` function to transform the rows into columns.

In [4]:
table = ps.DataFrame({'Latitude': lati_list, 'Longitude':long_list}, names_list)
table.columns.name = "Station Name"
table


Station Name,Latitude,Longitude
MALTEPE,40.91681013544846,29.13060758098593
AHIRKAPI,41.00314456999032,28.98289668101853
BEŞİKTAŞ-1,41.04116198628195,29.00778819900819
BEŞİKTAŞ-2,41.04065414312002,29.0055048939288
BOSTANCI,40.95173395654253,29.09425745312653
EMİNÖNÜ-1,41.01495987953694,28.97621869809887
EMİNÖNÜ-2,41.01495987953694,28.97621869809887
EMİNÖNÜ-3,41.01488637107048,28.97495985342729
EMİNÖNÜ-4,41.01488637107048,28.97495985342729
HAYDARPAŞA,40.99577360085738,29.01810215560077


# Question 2

For this question we are going to use Istanbul Municipality data at [this link](https://data.ibb.gov.tr/dataset/sehir-hatlari-sefer-sayilari). Data contains Istanbul Deniz Isletmeleri route information.

1. Understand what the data is for, what it records, what it contains. Explain it in broad strokes.
2. Extract the data about the number of trips between stations.
2. Calculate the total number of trips in 2020,
3. Calculate the total number of trips in 2021, 
4. Which is the busiest station in 2020 and 2021?

For these questions you must extract the specific data you need from the raw data. Let me be very clear: I am not interested in the numerical answer, I'd like to see your data extraction and calculation explicitly. I need to see your code with which you extract the data, see the data frame where you record the extracted data, and the code where you group and calculate the required results.

## Solution 2.1
The data contains 
* all the destinations that one can travel within Istanbul with the Municipality's ship service. 
* the number of trips made in these destinations(annually).
* the year in which the trips were made.

## Solution 2.2
The library `re` from `regex` was imported to split semi-colon separated list elements into individual strings. The *url* variables contain the web addresses of the *.cvs* files. 
The function `table_maker` performs these tasks in order:
1. It makes an *http* connection, downloads the raw data and reads it via pandas' `read_csv` function.
2. Converts the read *.csv* file into a dictionary. The dictionary is then converted repeatedly from a `dict_values` to a `list` then to another `dict_values` and finally into a `list` of tuples.
3. Creates empty lists that will later store the Destinations, Number of Trips, and the Years that the trips were made in.
4. Enters into a `for` loop whose index goes from 0 to the length of the list(that's as many as the number of destinations). It then appends the list's tuple elements into an empty list which is called `veriler`. The tuples are then split into `str`s via `regex`'s `re.split` function. 
5. Then the split tuple is checked for whether it contains any information or not(as the data set is not very professional, it contains some blank rows). If it does contain information, the information inside (which is split into 3 pieces) is appended to the empty lists; namely, `trips[]`, `destinations[]` and `years[]`.
6. The list `trips` consists of `float`s and `integer`s. The `float`s inside are scaled down by 1000 in the original dataset. So, another `for` loop is created to detect these `float`s and multiply them by 1000 to get the accurate value. The loops browses through the index of `trips[]`.
7. The number of trips is accumulated to a dummy integer `a`.
8. Obtained lists are then fed into `pandas`' `DataFrame()` function. The dataframe is displayed.
9. Finally the function outputs the dataframe(`df20`), the total number of trips(`a`), the `trips[]` list, and the `destinations[]` list **as a tuple**. 

In [5]:
import re
url2020 = "https://data.ibb.gov.tr/dataset/4be0d5f6-62de-4a24-a6a5-038009cde39f/resource/f1f95d5d-fa2f-479d-9d50-85ca1d604c1e/download/2020-yl-ehir-hatlar-sefer-saylar.csv"
url2021 = "https://data.ibb.gov.tr/dataset/4be0d5f6-62de-4a24-a6a5-038009cde39f/resource/d2c7e4c3-fd09-4952-8a8e-776e3accf91d/download/2021-yl-ehir-hatlar-sefer-saylar.csv"
def table_maker(url):
    data20 = ps.read_csv(url, encoding = 'latin-1');
    list20 =  data20.to_dict();
    list20 = list(list20.values()); list20 = list20[0]; list20 = list(list20.values());
    veriler = []; trips = []; destinations = []; years = [];
    a = 0;
    for i in range(len(list20)):
        veriler.append(list20[i]);
        dummy = veriler[i];
        dummy = re.split(r"\s*[;]\s*", dummy);
        if(dummy != ['', '', '']):
            veriler[i] = dummy;
            years.append(veriler[i][0])
            destinations.append(veriler[i][1])
            trips.append(veriler[i][2])
    for j in range(len(trips)):
        if float(trips[j])%1 != 0:
            trips[j] = float(trips[j])*1000
            trips[j] = int(trips[j])
        a = a+int(trips[j]);
            
    df20 = ps.DataFrame({'No. Of Trips': trips, 'Year': years}, destinations);
    ps.set_option("display.max_rows", None, "display.max_columns", None);
    return(df20,a, trips, destinations);

The data from 2020 and 2021 are fed into the function in the below 2 cells. Since we want to see the extracted data, we only display the 1st element of the output tuple.

In [6]:
table_maker(url2020)[0]

Unnamed: 0,No. Of Trips,Year
BEÞÝKTAÞ - KADIKÖY,26879,2020
KADIKÖY - KARAKÖY - BEÞÝKTAÞ,13,2020
EMÝNÖNÜ - ÜSKÜDAR,28441,2020
ÜSKÜDAR - KARAKÖY - EMÝNÖNÜ,8737,2020
KADIKÖY - EMÝNÖNÜ,18408,2020
KADIKÖY - KARAKÖY,25658,2020
KABATAÞ - KADIKÖY - ADALAR - BOSTANCI,5879,2020
ÝSTANBUL - ADALAR,4542,2020
KADIKÖY - KARAKÖY - EMÝNÖNÜ,11156,2020
BOÐAZ GÝDÝÞ GELÝÞ (EMÝNÖNÜ - BEÞÝKTAÞ -KUZGUNCUK - BEYLERBEYÝ - ÇENGELKÖY - ARNAVUTKÖY),523,2020


In [7]:
table_maker(url2021)[0]

Unnamed: 0,No. Of Trips,Year
BEÞÝKTAÞ-KADIKÖY,23658,2021
EMÝNÖNÜ-ÜSKÜDAR,23854,2021
EMÝNÖNÜ-KADIKÖY,18298,2021
EMÝNÖNÜ-BEÞÝKTAÞ-KUZGUNCUK-BEYLERBEYÝ-ÇENGELKÖY-ARNAVUTKÖY,497,2021
EMÝNÖNÜ-BEÞÝKTAÞ-ORTAKÖY-EMÝRGAN-PAÞABAHÇE-BEYKOZ,545,2021
ÇENGELKÖY-BEÞÝKTAÞ-EMÝNÖNÜ,433,2021
KADIKÖY-KARAKÖY,6168,2021
KADIKÖY-KARAKÖY-EMÝNÖNÜ,18304,2021
KABATAÞ-KADIKÖY-ADALAR,7046,2021
BOSTANCI- BÜYÜKADA-HEYBELÝADA,940,2021


## Solution 2.3
The function returns the total number of trips as one of its outputs.

In [8]:
trips2020 = table_maker(url2020)[1]
print ("The total number of trips made in 2020 is " + str(trips2020)+'.'); 

The total number of trips made in 2020 is 193669.


## Solution 2.4

In [9]:
trips2021 = table_maker(url2021)[1];
print ("The total number of trips made in 2021 is " + str(trips2021)+'.'); 

The total number of trips made in 2021 is 177882.


## Solution 2.5
The function `table_maker` returns a list that contains the number of trips in each destination as an element of its output (namely, the 3rd element of its output tuple), and a list that contains the destinations (namely, the 4th element of its output tuple). So, by calling the 3rd output and comparing its elements to each other via a for loop, we can determine which destination was the busiest. Then we can match that destination's index to the `destinations[]` index to find the name of the destination.

In [10]:
x = 0
y = 0
l_2020 = table_maker(url2020)[2];
for i in range(len(l_2020)):
    l_2020[i] = int(l_2020[i])
    if l_2020[i] > x:
        y = i
        x = l_2020[i]
d_2020 = table_maker(url2020)[3];        
print ("The busiest destination in 2020 was " + d_2020[y] + " with " + str(l_2020[y]) + " trips.")

The busiest destination in 2020 was EMÝNÖNÜ - ÜSKÜDAR with 28441 trips.


The same can be done for 2021.

In [11]:
x = 0
y = 0
l_2021 = table_maker(url2021)[2];
for i in range(len(l_2021)):
    l_2021[i] = int(l_2021[i])
    if l_2021[i] > x:
        y = i
        x = l_2021[i]
d_2021 = table_maker(url2021)[3];        
print ("The busiest destination in 2021 was " + d_2021[y] + " with " + str(l_2021[y]) + " trips.")

The busiest destination in 2021 was EMÝNÖNÜ-ÜSKÜDAR with 23854 trips.


# Question 3

Using a different [dataset](https://data.ibb.gov.tr/dataset/istanbul-deniz-iskeleleri-yolcu-sayilari) again from Istanbul Municipality on Istanbul Deniz Isletmeleri: 

1. Understand what the data is for, what it records, what it contains. Explain it in broad strokes.
2. Find out the busiest station in the years 2020 and 2021,
3. Repeat the same calculation monthly: find the busiest stations for each month.
4. Does your calculation of busiest stations agree with the calculation you made in Question #2? Explain.


## Solution 3.1
The data contains the names of all of the ship stations in Istanbul, the name of the stations' operating company, and the number of passengers who have used the stations monthly, from 3rd month to 11th month (3 and 11 included). 

## Solution 3.2

We first import the data from the dataset's link and since the data is in the type *.csv* we can use `pandas`' `read_csv` function. We then convert the data to a dictionary, then to a list, then we take the specific row of that list and lastly convert that row to another list. This series of operations allow us to reach the data that we desire and can easily work on.

In [12]:
import re
data3 = ps.read_csv("https://data.ibb.gov.tr/dataset/20f33ff0-1ab3-4378-9998-486e28242f48/resource/6fbdd928-8c37-43a4-8e6a-ba0fa7f767fb/download/istanbul-deniz-iskeleleri-yolcu-saylar.csv", encoding='latin-1')
data3 = data3.to_dict(); data3 =list(data3.values()); data3 = data3[0]; data3 = list(data3.values());

We then split the list's (`data3`) elements by station name (`station`), month (`month`), passenger count (`pcount`), and the authority (`auth`) so that we can work and analyze the numbers more easily (this won't be a problem since each row of information is unique (`dummy`) and can be found via indices in all the separate lists).

In [13]:
year = []; month = []; auth = []; station = []; pcount=[];dummy = ''; comp_data = []; month_indv = []; month_all = [];
for i in range(len(data3)):
    comp_data.append(data3[i]);
    dummy = str(comp_data[i])
    dummy = re.split(r"\s*[;]\s*", dummy)
    year.append(dummy[0]); month.append(dummy[1]); auth.append(dummy[2]);
    station.append(dummy[3]); pcount.append(int(dummy[4]));

We then convert the station names' list (`station`) into a set (`st_indv`) to prevent duplicates from occuring. With this set's elements outer `for` loop browsing through the set's elements, and the inner `for` loop browsing through the `station` list, we sum the number of passengers who have used that specific station to determine the annual number. We then create a dictionary that pairs the station names with the annual count, with the keys being the annual count and the values being the station names. We then use the built-in `max` function to find the maximum of the annual count and since that number is a key in the dictionary, we determine the busiest station by `dict[key]` and then print out the results.
(p.s. uncomment the bottom two lines of the below cell to see the Dataframe with station names and annual counts.)

In [14]:
x = ''; y = 0; anst_count = {}; count = 0;
st_indv = set(station)
for x in st_indv:
    count = 0;
    for y in range(len(station)):
        if station[y] == x:
            count = count + int(pcount[y]);
    appender = {count:x};
    anst_count.update(appender);
m = max(anst_count.keys())
print("The busiest station in 2021 was "+anst_count[m]+" with "+str(m)+" passengers.")
#ps.set_option('display.max_rows', None)
#ps.DataFrame({'Annual # of Passengers': anst_count.keys()}, anst_count.values())

The busiest station in 2021 was USKUDAR with 6083839 passengers.


## Solution 3.3
To find the busiest station in each month, we first need to group the raw data into months. To do that, we create a for loop that browses through the items of the set that contains the months (`set(month)`). Inside the main loop, we create an empty dummy list that will hold the monthly information, then do another for loop that browses through the original months list. Inside the second loop we first compare whether our set's current element (`p in set(month)`) is same as the months current element (`month[w]`) and if that is the case we add the necessarry information (station, month and passenger count) from that entry to our dummy list. So, by the end of the 1st inner loop, we will have created a list (`list_indv`) that contains a month's entries in an order and a list that contains only that month's station names (`list_indv2`). However, this first list contains some entries that have the same station name yet different passenger count. To merge this entries we create another loop that browses through the individual station names list (`list_indv2`) and check whether a station name occurs more than 1 time. If that is the case, we first add the 2nd occurance's passenger count to the 1st one and then remove the 2nd one. So, at the end of the second inner loop we will have listed a specific month's entries in a list (`list_indv`) and made sure that it had no duplicates in it. We then append this list to a bigger list (`list_all`) that contains the list of each month.

In [15]:
list_indv =[];list_all=[]; list_indv2 =[]; 
for p in set(month):
    j = 0; list_indv=[];
    list_indv2 =[];
    for w in range(len(month)):
        if month[w] ==p:
            list_indv2.append(station[w]);
            list_indv.append([station[w], month[w], pcount[w]]);
    for j in range(len(list_indv2)):
        if list_indv2.count(list_indv2[j]) !=1:
            try:
                indic = list_indv2.index(list_indv2[j], j+1, len(list_indv2));
                list_indv[j][2] = int(list_indv[j][2]);
                list_indv[j][2] += int(list_indv[indic][2]);
                list_indv[indic][2] = 0; 
            except:
                display()
    list_all.append(list_indv);

With the list obtained above (`list_all`) we create a loop by browsing on its elements (who are each month's lists) and then convert the current element to a dictionary that has the passenger counts as keys and station names as values. We then use the built-in `max` function to find the maximum of the keys (passenger counts). Finally, we print the maximum passenger count and its respective station name.

In [16]:
sozluk ={};
for t in range(len(list_all)):
    sozluk ={};
    temp = list_all[t]
    for k in temp:
        itr = {k[2]:str(k[0])}
        sozluk.update(itr)
    m = max(sozluk.keys())
    print("The busiest station in the " +str(k[1])+". month was " + str(sozluk[m])+ " with " +str(m)+" passengers." )

The busiest station in the 8. month was USKUDAR with 991168 passengers.
The busiest station in the 7. month was USKUDAR with 1025900 passengers.
The busiest station in the 6. month was USKUDAR with 756166 passengers.
The busiest station in the 11. month was USKUDAR with 322826 passengers.
The busiest station in the 9. month was USKUDAR with 957539 passengers.
The busiest station in the 10. month was USKUDAR with 1100606 passengers.
The busiest station in the 3. month was USKUDAR with 161828 passengers.
The busiest station in the 5. month was USKUDAR with 340145 passengers.
The busiest station in the 4. month was USKUDAR with 427549 passengers.


## Solution 3.4
In solution #2 we found that the "Eminonu - Uskudar" route was the busiest. In solution #3.3 we found that the "Uskudar" station guests the most passengers. The findings agree, but partially. We couldn't know from the solution #2 if it was Eminonu or Uskudar that causes the busyness. However, it is a fact that Uskudar passes its exceeds its predecessor, Beşiktaş, by 33% percent and so one can be confident to think that Uskudar is the source of the busyness.