In this tutorial, you will learn how to store and process structured data.

# How to deal with a complicate data processing tasks?

## Never solve problems manually

The people do data have one principle:

<div class="alert alert-info"><p>   Unless someone put the knife on your neck, don't do any  match, copy, paste and other mechanical process for each piece of data manually
    </p></div>
   
    
When we encounter any complicate problems, we will try our best to change them into programming problems that can be solved

## How should data processing tasks be decomposed?

When encountering a data processing task, the first and the most important thing we should consider is:

    What are the inputs and outputs of this data processing task?
   
More specifically, what kind of data do we have? How is the data stored? What results are we going to get? What is the form of the resulting data?
<img src="resource/6/2_en.png"  style="width:800px">  

After understanding the input and output, the next step is to divide the task into several small tasks:
<img src="resource/6/3_en.png"  style="width:800px">  
The output of each small task is the input of the next task

## Store data as a data table

In data processing task, the most commonly used way is to store data in the form of data table, that is structured data.

The data stored in the form of tables can be easily processed by SQL or Python pandas and other tools with high processing efficiency.

Moreover, the advantage of data tables is that each column can be added with one dimension of information.

In other words, if there are multiple tables of similar structure, you can concat the tables into one table, and add a column indicating which table the data comes from

# Example

## Data processing task

For example:

Suppose there is a batch of subway travel data, the data form is as follows:

<img src="resource/6/1.jpg" style="width:800px">




In this, each row of data represents a passenger's trip, and each number represents a different subway station. A trip will pass through a series of subway stations

From this Metro travel data, our data processing task is:

>Extract the transfer volume of each metro transfer station

That is to say, if metro station 1 is connected to line 1, line 2 and line 3, we want to get the flow of interchange between these three lines through station 1.

If we want to get all transfer volume of all transfer stations in the city, how to calculate?

## A possible solution

### Input
The input we already know is the data form above

However, the input data is not a very perfect structured data, because the length of each row is different, and a row contains information of multiple trajectory points.

The input we expect is the table data with perfect structured, like this:

<img src="resource/6/5_en.png" style="width:300px">

The advantage of data table is that we can add a column to store the information with a new dimension.

We don't need to save multiple tables for each person's travel trajectory. Instead, we add a column as the trip number to distinguish different trips


### Output
What is the output? Very simple, the transfer volume in all directions of all transfer stations in the city

So, in order to contain this information, how should we design the output data table?

The information to be included is, at which station, from which line, transfer to which line, and how many people

<img src="resource/6/4_en.png"  style="width:300px">  


### Decompose the task

After understanding the input and output, the task can be decomposed into the following four tasks:

>Task 1: reorganize the original input into perfect structured data to get Table 1  
Task 2: Generate Table 2 from Table 1, and arrange them in order, to see which metro lines each trip has traveled through  
Task 3: Generate Table 3 from Table 2, collect the transferring information, to see each trip is transferring from which line to which line at which station.   
Task 4: Generate Table 4 from Table 3, aggregate

<img src="resource/6/6_en.png"  style="width:1000px">  

# Home work

## Dynamic repeatability of buslines 

For a bus line, suppose that there are $n$ bus stations,then there will be $C^2_n=\frac{n(n-1)}{2}$ station pairs. 
If most of the station pairs of this bus line are the same as the station pairs of the other lines, then it is a huge waste of the public tansit resource, since its service can be replace by the other bus line. Here, we define the **Dynamic repeatability of buslines** as:
$$D_r=\frac{\sum_{i<j}{r_{ij}}}{\frac{n(n-1)}{2}}$$  
where $r_{ij}= 1$ if there are other buslines connecting station $i$ and $j$;$r_{ij}= 0$ if there is no another busline connecting station $i$ and $j$

In [1]:
#Read data
import pandas as pd
f = open(r'data-sample/busline.csv')
busline = pd.read_csv(f)
f.close()
busline.head(5)

Unnamed: 0,linename,stationname,stationgeo
0,650路(锦程文丰公交场站-凤凰山脚),"['锦程文丰公交场站', '明士达公司', '鸿桥工业园西', '鸿桥工业园', '三洋部件...","['113.78947398104066,22.728276040706554', '113..."
1,650路(凤凰山脚-锦程文丰公交场站),"['凤凰山脚', '凤凰第二工业区', '凤凰台湾街', '凤凰社区', '凤凰广场', '...","['113.8559999718322,22.68862002580157', '113.8..."
2,m502a线(龙西公交总站-龙西公交总站),"['龙西公交总站', '添利工业园', '瓦窑坑', '五联社区', '崇和学校', '美信...","['114.25368997182035,22.759529012563924', '114..."
3,高快巴士39路(华富路②-坂田风门坳总站),"['华富路②', '宏杨学校', '坂田地铁站', '扬马市场', '金洲嘉丽园', '坂田...","['114.08788496287927,22.548925018156375', '114..."
4,高快巴士39路(坂田风门坳总站-华富路②),"['坂田风门坳总站', '岗头市场', '华为基地', '华为单身公寓北', '万科城', ...","['114.0748459685757,22.675367010065482', '114...."


In [10]:
len(busline)

2113