# Introduction to Data Analysis with Python


<img src="https://www.python.org/static/img/python-logo.png" alt="yogen" style="width: 200px; float: right;"/>
<br>


# Objectives

* Handle tabular data with `pandas`

# The Python scientific stack: SciPy



Python Main Data Libraries
NumPy: Base N-dimensional array package

SciPy library: Fundamental library for scientific computing

Matplotlib: Comprehensive 2D Plotting

IPython: Enhanced Interactive Console

Sympy: Symbolic mathematics

pandas: Data structures & analysis

## Saving our work to Google Drive

Colab is a very convenient environment, but an ephemeral one. Let's first configure our notebook to save our work to Google Drive, where it can be persisted.

In [1]:
import os
drive_loc = '/content/gdrive'
files_loc = os.path.join(drive_loc, 'MyDrive', 'pdsfiles')

from google.colab import drive
drive.mount(drive_loc)

Mounted at /content/gdrive


Let's create a directory to hold all our work and make sure it's there:

In [2]:
!mkdir -p {files_loc}
!ls {files_loc}

207831-0-accidentes-trafico.xls    example.db_2
207831-0-accidentes-trafico.xls.1  excel_output_2.xls
207831-0-accidentes-trafico.xls.2  excel_output.xls
207831-0-accidentes-trafico.xls.3  out.csv_2
207831-0-accidentes-trafico.xls.4  s3.pkl
207831-0-accidentes-trafico.xls.5  T100_AIRLINES.csv
csv_output.xls			   T100I_SEGMENT_ALL_CARRIER.csv
df2.pkl				   uk_data.csv
example.db


## `pandas`

Distinct set of requirements that pandas introduces:

- Data structures with labeled axes supporting automatic or explicit data alignment. This prevents common errors resulting from misaligned data and working with differently-indexed data coming from different sources.

- Integrated time series functionality.

- The same data structures handle both time series data and non-time series data.

- Arithmetic operations and reductions (like summing across an axis) would pass on the metadata (axis labels).

- Flexible handling of missing data.

- Merge and other relational operations found in popular database databases (SQL-based, for example).

### Getting started with pandas

First, always remember to check the [API Reference](https://pandas.pydata.org/pandas-docs/stable/reference/index.html).

![Read the effing manual](https://i.kym-cdn.com/photos/images/newsfeed/000/017/668/Mao_RTFM_vectorize_by_cmenghi.png?1318992465)

Convention: when you see `pd`, it means the imported `Pandas` module. We'll be combining it with Numpy as well:

In [3]:
import pandas as pd
import numpy as np

Pandas is evolving. If you look around in the Internet or articles/books, you may find things that go back to Pandas 0.21. In our case, in the included Colab supported runtime, we've got:

In [8]:
pd.__version__

'1.1.5'

...which is quite updated.

Pandas Data Structures are **Series** and **Dataframes**. While they are not a universal solution, they provide a solid, easy-to-use foundation to data mangling tooling in the Data Science world.

### Series

A Series is a one-dimensional array-like object containing
- an array of **data** (of any NumPy data type)
- an associated array of data labels, called its **index**.

It is the base pandas abstraction. You can thing of it as the love child of a numpy array and a dictionary, sort of-ish.

The simplest Series is formed from only an array of data:

In [9]:
s = pd.Series([4, 7, -5, 3])
type(s)

pandas.core.series.Series

In [10]:
s

0    4
1    7
2   -5
3    3
dtype: int64

The string representation of a Series displayed interactively shows the index on the left and the values on the right. Since we did not specify an index for the data, a default one consisting of the integers 0 through N - 1 (where N is the length of the data) is created.

If we provide an index, pandas will use it. If not, it will automatically create one.

In [11]:
s.index

RangeIndex(start=0, stop=4, step=1)

Yep, that was Pandas [optimizing things](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.RangeIndex.html#:~:text=RangeIndex%20is%20a%20memory%2Dsaving,is%20provided%20by%20the%20user.). Same thing, we can ask our Series for just the values:

In [12]:
s.values

array([ 4,  7, -5,  3])

Just to be sure, I told you this was a Numpy array. It is:

In [13]:
type(s.values)

numpy.ndarray

Often it will be desirable to create a Series with an index identifying each data point:

In [14]:
s2 = pd.Series([1, 2, 4.5, 7, 2, 23, 15], index=list('javierc'))
s2

j     1.0
a     2.0
v     4.5
i     7.0
e     2.0
r    23.0
c    15.0
dtype: float64

We were quite lazy back there providing the list, weren't we? 

Compared with a regular NumPy array, you can use values in the index when selecting single values or a set of values:

In [15]:
s2['r']

23.0

Now, that's convenient.

NumPy array operations, such as filtering with a boolean array, scalar multiplication, or applying math functions, that you now know better than me, will preserve the index-value link:

In [16]:
s2 % 2 == 0

j    False
a     True
v    False
i    False
e     True
r    False
c    False
dtype: bool

We can similarly apply boolean selection operations to the Series: 

In [17]:
s2[s2 % 2 == 0]

a    2.0
e    2.0
dtype: float64

...and go beyond pure logical operations, of course:

In [18]:
s2 * 2

j     2.0
a     4.0
v     9.0
i    14.0
e     4.0
r    46.0
c    30.0
dtype: float64

In [19]:
np.exp(s2)

j    2.718282e+00
a    7.389056e+00
v    9.001713e+01
i    1.096633e+03
e    7.389056e+00
r    9.744803e+09
c    3.269017e+06
dtype: float64

Note that these operations are not being applied on the original object, but rather in a copy that's being returned:

In [20]:
s2

j     1.0
a     2.0
v     4.5
i     7.0
e     2.0
r    23.0
c    15.0
dtype: float64

Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping of index values to data values (because, remember, a Python dictionary is unordered, and it can be extended). It can be substituted into many functions that expect a dict:

In [21]:
'i' in s2

True

We can create Series from dictionaries:

In [22]:
sdata = {
    'Zaragoza' : 2.5e5,
    'Sevilla': 5e5,
    'Cordoba': 3e5,
    'Madrid': 6e6}

s3 = pd.Series(sdata)
s3

Zaragoza     250000.0
Sevilla      500000.0
Cordoba      300000.0
Madrid      6000000.0
dtype: float64

We'll be using s3 later on, so because of the ephemeral nature of Colab, we'll store our Series in our Google Drive instead of redoing all the steps.

We'll be using as well a handy feature of Colab, which is [Colab Forms](https://colab.research.google.com/notebooks/forms.ipynb) to enable o disable the conditional execution of a code cell based on the UI input exposed by the form:

In [23]:
#@title Saving s3 to Drive
import os
save_to_drive = True #@param {type:"boolean"}

if save_to_drive and files_loc:
  s3.to_pickle(os.path.join(files_loc,"s3.pkl"))
else:
  print('Please, mount Google Drive running the cell at the beginning and try again')

Let's check that our pickle file is there:

In [24]:
!ls {files_loc}/s3.pkl

/content/gdrive/MyDrive/pdsfiles/s3.pkl


When only passing a dict, the index in the resulting Series will **not** have the dict’s keys in sorted order:

In [25]:
s3

Zaragoza     250000.0
Sevilla      500000.0
Cordoba      300000.0
Madrid      6000000.0
dtype: float64

If you want the keys in sorted order, you need to explicitly define it with `sort_index()`:

In [26]:
s3 = pd.Series(sdata).sort_index()
s3

Cordoba      300000.0
Madrid      6000000.0
Sevilla      500000.0
Zaragoza     250000.0
dtype: float64

You can control the ordering as well by explicitly defining the index order using a list:

In [27]:
cities = ['Cordoba', 'Madrid', 'Valencia', 'Zaragoza']
s4 = pd.Series(sdata, index=cities)
s4

Cordoba      300000.0
Madrid      6000000.0
Valencia          NaN
Zaragoza     250000.0
dtype: float64

In this case, 3 values found in sdata were placed in the appropriate locations, but since no value for 'Valencia' was found, it appears as NaN (not a number) which is considered in pandas to mark missing or NA values. We will use the terms “missing” or “NA” to refer to missing data. The isnull and notnull functions in pandas should be used to detect missing data:

In [28]:
pd.isnull(s4)

Cordoba     False
Madrid      False
Valencia     True
Zaragoza    False
dtype: bool

In [29]:
pd.notnull(s4)

Cordoba      True
Madrid       True
Valencia    False
Zaragoza     True
dtype: bool

These can also be consumed as instance methods:

In [30]:
s4.notnull()

Cordoba      True
Madrid       True
Valencia    False
Zaragoza     True
dtype: bool

A critical `Series` feature for many applications is that it automatically aligns differently-indexed data in arithmetic operations:

In [31]:
s3 + s4

Cordoba       600000.0
Madrid      12000000.0
Sevilla            NaN
Valencia           NaN
Zaragoza      500000.0
dtype: float64

Both the Series object itself and its index have a name attribute, which integrates with other key areas of pandas functionality:

In [32]:
s4.name = 'Population'
s4.index.name = 'Province'
s4

Province
Cordoba      300000.0
Madrid      6000000.0
Valencia          NaN
Zaragoza     250000.0
Name: Population, dtype: float64

In [33]:
s3

Cordoba      300000.0
Madrid      6000000.0
Sevilla      500000.0
Zaragoza     250000.0
dtype: float64

In [34]:
s3[0]

300000.0

From what we've seen so far, it may look like the Series object is basically interchangeable with a one-dimensional NumPy array.

The essential difference is the presence of the index: while the Numpy Array has an implicitly defined integer index used to access the values, the Pandas Series has an explicitly defined index associated with the values.

### DataFrame

A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). The DataFrame has both a row and column index; it can be thought of as a dict of Series (one for all sharing the same index).

Row-oriented and column-oriented operations in DataFrame are treated roughly symmetrically. Under the hood, the data is stored as one or more two-dimensional blocks rather than a list, dict, or some other collection of one-dimensional arrays.

This is the object you'll work most of the time with. It represents a table of _m_ observations x _n_ variables. Each variable, or column, is a Series.

There are numerous ways to construct a DataFrame, though one of the most common is from a dict of equal-length lists or NumPy arrays.

Let's discuss the naming of axes in Pandas, both for series and dataframes.

For Series, because is a one-dimensional array of values, we've got only **Axis 0**:

![Series axes](https://drive.google.com/uc?export=view&id=1unXqkxjezJiaocs1D-OM2vdj9ad8OLbs)



The dataframe, as we've just seen, is a two-dimensional structure. It has columns and row, colums made of separate Series objects. The axes in the dataframe are as follows, where if not explicitly mentioned, axis 0 will always make the default:

![Dataframe axes](https://drive.google.com/uc?export=view&id=1f8jKqZTURUoM5wV1yz_Ax9PrXSr-uITy)

Ok, enough theory. Let's go for some practice, defining a dataframe just as we said is quite common:

In [35]:
dfdata = {
    'province' : ['M', 'M', 'M', 'B', 'B'],
    'population': [1.5e6, 2e6, 3e6, 5e5, 1.5e6],
    'year' : [1900, 1950, 2000, 1900, 2000]   
}


The resulting DataFrame will have its index assigned automatically as with Series, and the columns (axis 1) are placed in sorted order:

In [36]:
df = pd.DataFrame(dfdata)
df

Unnamed: 0,province,population,year
0,M,1500000.0,1900
1,M,2000000.0,1950
2,M,3000000.0,2000
3,B,500000.0,1900
4,B,1500000.0,2000


If you specify a sequence of columns, the DataFrame’s columns will be exactly what you pass, and as with Series, if you pass a column that isn’t contained in data, it will appear with NA values in the result:

In [57]:
df2 = pd.DataFrame(dfdata, columns=['province','population', 'year', 'debt'])
df2

Unnamed: 0,province,population,year,debt
0,M,1500000.0,1900,
1,M,2000000.0,1950,
2,M,3000000.0,2000,
3,B,500000.0,1900,
4,B,1500000.0,2000,


If we check the nature of the index in the dataframe, we can see we have the same as in Series:

In [58]:
df2.index

RangeIndex(start=0, stop=5, step=1)

There's a new property we can access called `columns`, where we can see we've got just another index:

In [59]:
df2.columns

Index(['province', 'population', 'year', 'debt'], dtype='object')

In [60]:
type(df2.columns)

pandas.core.indexes.base.Index

A column in a DataFrame can be retrieved as a Series either by dict-like notation or by attribute:

In [61]:
df2['population']

0    1500000.0
1    2000000.0
2    3000000.0
3     500000.0
4    1500000.0
Name: population, dtype: float64

In [62]:
df2.population

0    1500000.0
1    2000000.0
2    3000000.0
3     500000.0
4    1500000.0
Name: population, dtype: float64

Note that the returned Series have the same index as the DataFrame, and their name attribute has been appropriately set.

You can see that what's being returned is in fact a `Series` object, altough as of now it should be quite clear:

In [63]:
type(df2.population)

pandas.core.series.Series

Using this notation, we can add more columns and they will be indexed following the criteria already defined by the dataframe:

In [64]:
df2.iloc[0]

province            M
population    1.5e+06
year             1900
debt              NaN
Name: 0, dtype: object

In [45]:
df2.iloc[0,1]

1500000.0

In [46]:
df2['2nd_language'] = np.nan

When assigning lists or arrays to a column, the value’s length must match the length of the DataFrame. If you assign a Series, it will be instead conformed exactly to the DataFrame’s index, inserting missing values in any holes:

In [47]:
df2['2nd_language']

0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
Name: 2nd_language, dtype: float64

But watch out with the naming used, you may hit some of the Python syntax constraints:

In [48]:
df2.2nd_language

SyntaxError: ignored

As with Series, we can name the index of the dataframe:

In [65]:
df2.index = list('abcde')

In [66]:
df2

Unnamed: 0,province,population,year,debt
a,M,1500000.0,1900,
b,M,2000000.0,1950,
c,M,3000000.0,2000,
d,B,500000.0,1900,
e,B,1500000.0,2000,


In [67]:
df2.province.a

'M'

In [81]:
df2[1:2]

Unnamed: 0,province,population,year,debt
b,M,2000000.0,1950,


This is not a RangeIndex anymore, but a regular Index:

In [76]:
df2.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

We can access a particular row of the Dataframe using the property `loc`:

In [77]:
df2.loc['c']

province          M
population    3e+06
year           2000
debt            NaN
Name: c, dtype: object

In [78]:
type(df2.loc['c'])

pandas.core.series.Series

`loc` admits a list or array of labels:

In [82]:
df2.loc[['a', 'b']]

Unnamed: 0,province,population,year,debt
a,M,1500000.0,1900,
b,M,2000000.0,1950,


If we pass a list with the specific row instead of the label as is, we get the nice Dataframe formatting instead of the Series:

In [83]:
df2.loc[['c']]

Unnamed: 0,province,population,year,debt
c,M,3000000.0,2000,


In [84]:
type(df2.loc[['c']])

pandas.core.frame.DataFrame

We can also pass it an slice object of labels:

In [85]:
df2.loc['a':'c']

Unnamed: 0,province,population,year,debt
a,M,1500000.0,1900,
b,M,2000000.0,1950,
c,M,3000000.0,2000,


We can eve use a callable condition for matching rows:

In [86]:
df2.loc[df2['year'] > 1950]

Unnamed: 0,province,population,year,debt
c,M,3000000.0,2000,
e,B,1500000.0,2000,


When assigning lists or arrays to a column, the value’s length must match the length of the DataFrame. If you assign a Series, it will be instead conformed exactly to the DataFrame’s index, inserting missing values in any holes:

In [87]:
val = pd.Series([0.1, 0.6, 0.9], index=['b','d','e'])
val

b    0.1
d    0.6
e    0.9
dtype: float64

So let's define values for a particular label or column by passing them as a list:

In [88]:
df2['debt'] = [1,0,2,.5,.7]
df2

Unnamed: 0,province,population,year,debt
a,M,1500000.0,1900,1.0
b,M,2000000.0,1950,0.0
c,M,3000000.0,2000,2.0
d,B,500000.0,1900,0.5
e,B,1500000.0,2000,0.7


In [89]:
df2['debt'] = val
df2

Unnamed: 0,province,population,year,debt
a,M,1500000.0,1900,
b,M,2000000.0,1950,0.1
c,M,3000000.0,2000,
d,B,500000.0,1900,0.6
e,B,1500000.0,2000,0.9


Assigning a column that doesn’t exist will create a new column:

In [90]:
df2['capital'] = df2['province'] == 'M'
df2

Unnamed: 0,province,population,year,debt,capital
a,M,1500000.0,1900,,True
b,M,2000000.0,1950,0.1,True
c,M,3000000.0,2000,,True
d,B,500000.0,1900,0.6,False
e,B,1500000.0,2000,0.9,False


The del keyword will delete columns as with a dict:

In [91]:
del df2['2nd_language']
df2

KeyError: ignored

In [92]:
df2['2nd_language'] = np.nan

You can always transpose the Dataframe and it will switch the indexes in the corresponding axes:

In [93]:
df2.T

Unnamed: 0,a,b,c,d,e
province,M,M,M,B,B
population,1.5e+06,2e+06,3e+06,500000,1.5e+06
year,1900,1950,2000,1900,2000
debt,,0.1,,0.6,0.9
capital,True,True,True,False,False
2nd_language,,,,,


The method `describe` computes a set of summary statistics for Servies of each DataFrame column: 

In [94]:
df2.describe()

Unnamed: 0,population,year,debt,2nd_language
count,5.0,5.0,3.0,0.0
mean,1700000.0,1950.0,0.533333,
std,908295.1,50.0,0.404145,
min,500000.0,1900.0,0.1,
25%,1500000.0,1900.0,0.35,
50%,1500000.0,1950.0,0.6,
75%,2000000.0,2000.0,0.75,
max,3000000.0,2000.0,0.9,


Of course, we can transpose this as well:

In [95]:
df2.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
population,5.0,1700000.0,908295.106229,500000.0,1500000.0,1500000.0,2000000.0,3000000.0
year,5.0,1950.0,50.0,1900.0,1900.0,1950.0,2000.0,2000.0
debt,3.0,0.5333333,0.404145,0.1,0.35,0.6,0.75,0.9
2nd_language,0.0,,,,,,,


One simple way of counting/finding non nulls is to just apply the `count()` method on our dataframe:

In [96]:
df2.count() 

province        5
population      5
year            5
debt            3
capital         5
2nd_language    0
dtype: int64

### Index objects

Pandas’s Index objects are responsible for holding the axis labels and other metadata (like the axis name or names). Any array or other sequence of labels used when constructing a Series or DataFrame is internally converted to an Index, which are immutable and can't be modified by you:

In [97]:
df2.index[1] = 'x'

TypeError: ignored

In [98]:
df2.index[1]

'b'

Inmutability is important so that Index objects ca be safely shared amongst data structures:

In [99]:
series_temp = pd.Series([1.5, -2.5, 0, 1, 2],df2.index)
series_temp

a    1.5
b   -2.5
c    0.0
d    1.0
e    2.0
dtype: float64

We can also manipulate rows by referencing the index location in a similar manner to what we saw before:

In [100]:
df2.iloc[2:]

Unnamed: 0,province,population,year,debt,capital,2nd_language
c,M,3000000.0,2000,,True,
d,B,500000.0,1900,0.6,False,
e,B,1500000.0,2000,0.9,False,


In [None]:
#@title Saving df2 to Drive
import os
save_to_drive = True #@param {type:"boolean"}

if save_to_drive and files_loc:
  df2.to_pickle(os.path.join(files_loc,"df2.pkl"))
else:
  print('Please, mount Google Drive running the cell at the beginning and try again')

In [None]:
!ls {files_loc}

In [None]:
files_loc

### More on Loc and iLoc [optional practice]
See section 1 of the scrapbook.

### Dropping entries from an axis

Dropping one or more entries from an axis is easy if you have an index array or list without those entries. As that can require a bit of munging and set logic, the drop method will return a new object with the indicated value or values deleted from an axis.

Let's create a new series to demonstrate all this:


In [101]:
s5 = pd.Series(np.arange(5), list('jduvk'))
s5

j    0
d    1
u    2
v    3
k    4
dtype: int64

In [102]:
s6 = s5.drop(['d','k'])
s6

j    0
u    2
v    3
dtype: int64

Yes, we dropped elements present in the index defined by a list, but can we do just the opposite and keep the elements provided in a list while dropping everything else?:

In [103]:
s6b = s5[s5.index.intersection(['d','k'])]
s6b

d    1
k    4
dtype: int64

By default, `drop()` doesn't modify the original Series, it creates a copy. We can change that with the argument `inplace` that we'll see later on:

In [104]:
s5

j    0
d    1
u    2
v    3
k    4
dtype: int64

In [105]:
s6['u'] = 7
s6

j    0
u    7
v    3
dtype: int64

In [106]:
s5

j    0
d    1
u    2
v    3
k    4
dtype: int64

Let's now work with Dataframes. First, let's see dropping elements from axes in a dataframe:

In [None]:
#@title Loading df2 from Drive
load_from_drive = True #@param {type:"boolean"}

if load_from_drive and files_loc:
  df2 = pd.read_pickle(os.path.join(files_loc,"df2.pkl"))
else:
  print('Please, mount Google Drive running the cell at the beginning and try again')

In [107]:
df2

Unnamed: 0,province,population,year,debt,capital,2nd_language
a,M,1500000.0,1900,,True,
b,M,2000000.0,1950,0.1,True,
c,M,3000000.0,2000,,True,
d,B,500000.0,1900,0.6,False,
e,B,1500000.0,2000,0.9,False,


Using `drop` on the dataframe will operate on the defaul axis 0, meaning that the specified object needs to be a **row**:

In [108]:
df2.drop('c')

Unnamed: 0,province,population,year,debt,capital,2nd_language
a,M,1500000.0,1900,,True,
b,M,2000000.0,1950,0.1,True,
d,B,500000.0,1900,0.6,False,
e,B,1500000.0,2000,0.9,False,


Let's now select the columns axis (axis 1) to remove a specific **column**, in this case `2nd_language`:

In [109]:
df2.drop('2nd_language', axis=1)

Unnamed: 0,province,population,year,debt,capital
a,M,1500000.0,1900,,True
b,M,2000000.0,1950,0.1,True
c,M,3000000.0,2000,,True
d,B,500000.0,1900,0.6,False
e,B,1500000.0,2000,0.9,False


We can see that we didn't modify the dataframe. In fact, we can make a copy to it:

In [110]:
df3 = df2.copy()
df3

Unnamed: 0,province,population,year,debt,capital,2nd_language
a,M,1500000.0,1900,,True,
b,M,2000000.0,1950,0.1,True,
c,M,3000000.0,2000,,True,
d,B,500000.0,1900,0.6,False,
e,B,1500000.0,2000,0.9,False,


Yes! the `copy()` method is in fact a deep copy and what we're avoiding is accidentally modifying the original dataframe. Let's have a look with another example (slight detour).


In [None]:
df_detour = pd.DataFrame({'x': [1,2]})
df_sub = df_detour[0:1]
df_sub.x = -1
df_detour

Aha! We even get a warning... but let's continue and see how this, in contrast, leaves df_detour unchanged:

In [None]:
df_detour = pd.DataFrame({'x': [1,2]})
df_sub_copy = df_detour[0:1].copy()
df_sub_copy.x = -1
df_detour

OK, going back to what we were doing!

As mentioned before, let's use the parameter `inplace` to modify the dataframe right away:

In [None]:
df3.drop('capital', axis=1, inplace=True)
df3

### Indexing, selection, and filtering

The key here is that we can build boolean Series that we can use to index the original Series or DataFrame. Those booleans can be combined with bitwise boolean operators (&, |, ~) to get filters that are as complex as we need. 

Let's revisit our `s3` series that we defined above:

In [None]:
#@title Loading s3 from Drive
load_from_drive = True #@param {type:"boolean"}

if load_from_drive and files_loc:
  s3 = pd.read_pickle(os.path.join(files_loc,"s3.pkl"))
else:
  print('Please, mount Google Drive running the cell at the beginning and try again')

In [111]:
s3

Cordoba      300000.0
Madrid      6000000.0
Sevilla      500000.0
Zaragoza     250000.0
dtype: float64

We can select elements from the Series just passing a list of them:

In [112]:
s3[['Zaragoza', 'Madrid']]

Zaragoza     250000.0
Madrid      6000000.0
dtype: float64

Of course, we can use slice notation (remember we start at 0):

In [113]:
s3[2:]

Sevilla     500000.0
Zaragoza    250000.0
dtype: float64

Slicing with labels behaves differently than normal Python slicing in that the endpoint is inclusive:

In [114]:
s3['Sevilla':'Cordoba']

Series([], dtype: float64)

But be careful, because by using the index position we default back to good ol' Python behavior (not inclusive):

In [115]:
s3[1:2]

Madrid    6000000.0
dtype: float64

We can apply a filter to the whole Series, generating in fact a boolean mask:

In [116]:
s3 > 1e06

Cordoba     False
Madrid       True
Sevilla     False
Zaragoza    False
dtype: bool

...and we can pass that boolean mask to the Series itself:





In [117]:
s3[s3>1e06]

Madrid    6000000.0
dtype: float64

Let's work now with our dataframe `df3`:

In [118]:
df3

Unnamed: 0,province,population,year,debt,capital,2nd_language
a,M,1500000.0,1900,,True,
b,M,2000000.0,1950,0.1,True,
c,M,3000000.0,2000,,True,
d,B,500000.0,1900,0.6,False,
e,B,1500000.0,2000,0.9,False,


First, let's build a boolean mask (filter) over the column `year` of our dataframe, which will give us a Series:

In [119]:
df3['year'] > 1950

a    False
b    False
c     True
d    False
e     True
Name: year, dtype: bool

And then, let's apply that filter over the full dataframe. The dataframe will match the Series name against its columns and will proceed to filter out the rows (axis 0) that match the criteria:

In [120]:
df3[df3['year'] > 1950]

Unnamed: 0,province,population,year,debt,capital,2nd_language
c,M,3000000.0,2000,,True,
e,B,1500000.0,2000,0.9,False,


We can combine all this, making more powerful filters:

In [121]:
df3[(df3['year'] > 1900) & (df3['debt'] > 0.5)]

Unnamed: 0,province,population,year,debt,capital,2nd_language
e,B,1500000.0,2000,0.9,False,


We can as well write this in a more elegant and Pythonic way:

In [122]:
recent = df3['year'] > 1900
indebted = df3['debt'] > 0.5

df3[recent & indebted]

Unnamed: 0,province,population,year,debt,capital,2nd_language
e,B,1500000.0,2000,0.9,False,


### Function application and mapping

Function application and mapping allows us to modify the elements of a DataFrame (columns with apply or elements with applymap) without for loops. This way we are not constrained to the functions already implemented by pandas or numpy.

In [123]:
df3

Unnamed: 0,province,population,year,debt,capital,2nd_language
a,M,1500000.0,1900,,True,
b,M,2000000.0,1950,0.1,True,
c,M,3000000.0,2000,,True,
d,B,500000.0,1900,0.6,False,
e,B,1500000.0,2000,0.9,False,


In [124]:
np.sqrt(df3['population'])

a    1224.744871
b    1414.213562
c    1732.050808
d     707.106781
e    1224.744871
Name: population, dtype: float64

In [125]:
df4 = pd.DataFrame(np.random.randn(4,3) * 17 + 15, columns=list('bde'), index=list('BMPZ'))
df4

Unnamed: 0,b,d,e
B,36.732457,12.337233,19.944179
M,4.672737,7.676932,47.144913
P,21.633124,11.650907,0.380926
Z,14.576186,-1.536476,-4.324391


In [126]:
np.abs(df4)

Unnamed: 0,b,d,e
B,36.732457,12.337233,19.944179
M,4.672737,7.676932,47.144913
P,21.633124,11.650907,0.380926
Z,14.576186,1.536476,4.324391


In [127]:
df4.b.B

36.732457123532875

This is a typical use case for lambdas (anonymous functions)

In [135]:
df4.iloc[0,0]

36.732457123532875

In [136]:
df4.apply(lambda series: series.max() - series.min())

b    32.059720
d    13.873710
e    51.469304
dtype: float64

In [137]:
df4.applymap(lambda element: element % 10 )

Unnamed: 0,b,d,e
B,6.732457,2.337233,9.944179
M,4.672737,7.676932,7.144913
P,1.633124,1.650907,0.380926
Z,4.576186,8.463524,5.675609


In [138]:
df4.apply(lambda series: series.max() - series.min(), axis=1)

B    24.395224
M    42.472176
P    21.252199
Z    18.900577
dtype: float64

In [139]:
def f(series):
    return pd.Series([series.max(), series.min()], index=['max', 'min'])

df4.apply(f)

Unnamed: 0,b,d,e
max,36.732457,12.337233,47.144913
min,4.672737,-1.536476,-4.324391


In [None]:
def f2(series):
  return pd.Series([series.max() - series.min()], index=['distance'])

df4.apply(f2)

In [None]:
for item in df4.items():
    print(item)

In [None]:
for item in df4.iteritems():
    print(item)

In [None]:
map(f, [1,2])

In [None]:
def format_2digits(number):
    return '%.2f' % number

In [None]:
df4.applymap(format_2digits)

### Sorting and ranking

In [140]:
df4

Unnamed: 0,b,d,e
B,36.732457,12.337233,19.944179
M,4.672737,7.676932,47.144913
P,21.633124,11.650907,0.380926
Z,14.576186,-1.536476,-4.324391


In [143]:
df4.sort_index(ascending=True)

Unnamed: 0,b,d,e
B,36.732457,12.337233,19.944179
M,4.672737,7.676932,47.144913
P,21.633124,11.650907,0.380926
Z,14.576186,-1.536476,-4.324391


In [144]:
df4.sort_index(ascending=False)

Unnamed: 0,b,d,e
Z,14.576186,-1.536476,-4.324391
P,21.633124,11.650907,0.380926
M,4.672737,7.676932,47.144913
B,36.732457,12.337233,19.944179


In [145]:
df4.sort_index(ascending=False, axis=1)

Unnamed: 0,e,d,b
B,19.944179,12.337233,36.732457
M,47.144913,7.676932,4.672737
P,0.380926,11.650907,21.633124
Z,-4.324391,-1.536476,14.576186


In [141]:
df4.sort_values(by='e')

Unnamed: 0,b,d,e
Z,14.576186,-1.536476,-4.324391
P,21.633124,11.650907,0.380926
B,36.732457,12.337233,19.944179
M,4.672737,7.676932,47.144913


We can sort by two columns

In [142]:
df4.sort_values(by=['e','b'])

Unnamed: 0,b,d,e
Z,14.576186,-1.536476,-4.324391
P,21.633124,11.650907,0.380926
B,36.732457,12.337233,19.944179
M,4.672737,7.676932,47.144913


In [146]:
s1 = pd.Series([2,3,8,4,3,2,1], index=list('abcdefg'))
s1

a    2
b    3
c    8
d    4
e    3
f    2
g    1
dtype: int64

In [147]:
s1.sort_values()

g    1
a    2
f    2
b    3
e    3
d    4
c    8
dtype: int64

rank() returns the positions of the elements of the Series in its sorted version. If there are ties, it will take averages.

In [148]:
s1.rank()

a    2.5
b    4.5
c    7.0
d    6.0
e    4.5
f    2.5
g    1.0
dtype: float64

In [149]:
pd.Series([1,1,1]).rank()

0    2.0
1    2.0
2    2.0
dtype: float64

In [150]:
s2 = pd.Series([30,10,20], index=list('abc'))
s2

a    30
b    10
c    20
dtype: int64

In [151]:
s2.rank()

a    3.0
b    1.0
c    2.0
dtype: float64

In [152]:
help(s2.rank)

Help on method rank in module pandas.core.generic:

rank(axis=0, method:str='average', numeric_only:Union[bool, NoneType]=None, na_option:str='keep', ascending:bool=True, pct:bool=False) -> ~FrameOrSeries method of pandas.core.series.Series instance
    Compute numerical data ranks (1 through n) along axis.
    
    By default, equal values are assigned a rank that is the average of the
    ranks of those values.
    
    Parameters
    ----------
    axis : {0 or 'index', 1 or 'columns'}, default 0
        Index to direct ranking.
    method : {'average', 'min', 'max', 'first', 'dense'}, default 'average'
        How to rank the group of records that have the same value (i.e. ties):
    
        * average: average rank of the group
        * min: lowest rank in the group
        * max: highest rank in the group
        * first: ranks assigned in order they appear in the array
        * dense: like 'min', but rank always increases by 1 between groups.
    
    numeric_only : bool, opti

#### Exercise

Write a function that takes a Series and returns the top 10% registers. In this case, earners. Test it with this Series:

```python
salaries = pd.Series([150000, 90000, 120000,30000,10000,5000,40000, 50000, 80000, 35000, 27000,14000, 28000, 22000,25000])
```

In [None]:
salaries = pd.Series([150000, 90000, 120000,30000,10000,5000,40000, 50000, 80000, 35000, 27000,14000, 28000, 22000,25000])

In [None]:
len(salaries)

In [None]:
def top_earners(serie):
    number_to_extract = round(len(serie) / 10)
    return salaries.sort_values()[-number_to_extract:]

top_earners(salaries)

In [None]:
def top_earners(serie, percentile=0.9):
    is_top_earner = serie.rank(pct=True) > percentile
    return serie[is_top_earner]

print(top_earners(salaries))
print(top_earners(salaries, .8))

## Summarizing and computing descriptive statistics

In [None]:
x = pd.Series([1.2, np.nan, 4, np.nan, 9], index=list('abcde'))
y = pd.Series([5, 3, 7, np.nan, 14], index=list('abcde'))

df = pd.DataFrame([x, y], index=['x','y'])
df

In [None]:
df = pd.DataFrame([x, y], index=['x','y']).T
df

In [None]:
df.sum()

As with many methods, we can use them in the direction perpendicular to their default.

In [None]:
df.sum(axis=1)

In [None]:
pd.__version__

In [None]:
df.sum(axis=1, skipna=False)

In [None]:
df.mean()

In [None]:
df.mean(axis=1)

In [None]:
df.cumsum()


In [None]:
df.std()

In [None]:
df.describe()

In [None]:
df['x'].sum()

In [None]:
df['x'].describe()

### Unique values, value counts, and membership

In [153]:
s7 = pd.Series(list('gtcaaagcttcga'))
s7

0     g
1     t
2     c
3     a
4     a
5     a
6     g
7     c
8     t
9     t
10    c
11    g
12    a
dtype: object

In [154]:
s7.unique()

array(['g', 't', 'c', 'a'], dtype=object)

In [156]:
s7.value_counts()

a    4
c    3
g    3
t    3
dtype: int64

In [157]:
puric_bases = ['a','g']
s7.isin(puric_bases)

0      True
1     False
2     False
3      True
4      True
5      True
6      True
7     False
8     False
9     False
10    False
11     True
12     True
dtype: bool

In [158]:
s7[s7.isin(puric_bases)]

0     g
3     a
4     a
5     a
6     g
11    g
12    a
dtype: object

## Handling missing data

In [None]:
string_data = pd.Series(['Ma', 'Lu', 'Ca', 'Va', np.nan])
string_data

In [None]:
string_data[string_data!=np.nan]

This is weird... but it has some really good reasons. You can find explanations [here](https://stackoverflow.com/questions/10034149/why-is-nan-not-equal-to-nan) and [here](https://stackoverflow.com/questions/1565164/what-is-the-rationale-for-all-comparisons-returning-false-for-ieee754-nan-values)

In [None]:
np.nan == np.nan

In [None]:
string_data[~string_data.isnull()]

### Filtering out missing data

In [None]:
string_data[string_data.notnull()]

In [250]:
df5 = pd.DataFrame([[1,2,3], 
                    [np.nan, 8, 7], 
                    [4, np.nan, 90], 
                    [67,42,53]], 
                   columns=list('abc'))
df5

Unnamed: 0,a,b,c
0,1.0,2.0,3
1,,8.0,7
2,4.0,,90
3,67.0,42.0,53


In [251]:
df5[df5['a'].notnull()]

Unnamed: 0,a,b,c
0,1.0,2.0,3
2,4.0,,90
3,67.0,42.0,53


In [252]:
df5.notnull()

Unnamed: 0,a,b,c
0,True,True,True
1,False,True,True
2,True,False,True
3,True,True,True


any() and all() are functions of boolean Series. They reduce the Series to a single boolean value by applying repeatedly the operators "or" and "and", respectively.

In [253]:
df5.notnull().any()

a    True
b    True
c    True
dtype: bool

In [254]:
df5.notnull().all()

a    False
b    False
c     True
dtype: bool

In [255]:
df5.isnull().any()

a     True
b     True
c    False
dtype: bool

In [256]:
df5.dropna()

Unnamed: 0,a,b,c
0,1.0,2.0,3
3,67.0,42.0,53


In [257]:
df5

Unnamed: 0,a,b,c
0,1.0,2.0,3
1,,8.0,7
2,4.0,,90
3,67.0,42.0,53


In [258]:
df5.dropna(axis=1)

Unnamed: 0,c
0,3
1,7
2,90
3,53


In [None]:
array = np.random.randn(8,3) * 20 + 100

df6 = pd.DataFrame(array, columns=list('xyz'), index=list('abcdefgh'))
df6.iloc[2:5, 1] = np.nan
df6.iloc[1:3, 2] = np.nan
df6

The thresh argument specifies the minimum number of non-null values required to keep a column (or row, with axis=1)

In [None]:
df6.dropna(thresh=2)

In [None]:
df6.dropna(thresh=2, axis=1)

In [None]:
df6.dropna(thresh=6, axis=1)

### Filling in missing data

In [None]:
df6.fillna(0)

In [None]:
df6.fillna({'x' : 100, 'y' : 50, 'z' : 20})

In [None]:
df6

In [None]:
df6.fillna(method='ffill')

In [None]:
df6.fillna(df6.median())

In [None]:
df6.median()

# Loading and saving data

## Loading CSV

Let's load information coming from the [US government bureau of Transportation Statistics](https://www.transtats.bts.gov/Tables.asp?DB_ID=111). For convenience, I've made this table available for you from Drive:

In [None]:
!wget https://bit.ly/ks-pds-csv3 -O {files_loc}/T100I_SEGMENT_ALL_CARRIER.csv

Make sure the file is there, and store the path in a Python variable using a Linux shell filter:

In [None]:
!ls {files_loc}

In [None]:
contents = !ls {files_loc}/T100I_SEGMENT_ALL_CARRIER.csv
csv_file = contents[0] #storing the first occurrence of the filter, should be our file

In [None]:
trafficDf = pd.read_csv(csv_file)

In [None]:
len(trafficDf)

In [None]:
trafficDf.head()

## Saving to Excel

Let's save the first 1000 rows of the dataframe in an Excel file:

In [None]:
trafficDf.head(100).to_excel(os.path.join(files_loc, "excel_output_2.xls"))

Again, check that the file got generated. If you've got Office, you can test this is indeed a proper excel file:

In [None]:
!ls {files_loc}/*.xls

## Saving to CSV

Now, let's truncate our existing dataframe and save the first 10 rows into another CSV:

In [None]:
trafficDf.head(10).to_csv()

In [None]:
trafficDf.head(100).to_csv(os.path.join(files_loc, "out.csv_2"))

In [None]:
!ls {files_loc}/*.csv

## To Sql Database

Saving to a database with Pandas is trivial as well. For testing purposes, we'll be using a file-based sqlite3 database that we're creating from code:

In [None]:
import sqlite3
conn = sqlite3.connect(os.path.join(files_loc,'example.db_2'))

In [None]:
trafficDf.to_sql('traffic',conn, if_exists='replace')

In [None]:
!ls {files_loc}

## To dictionary and to json

See documentation of [pandas.DataFrame.to_dict](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_dict.html) to understand different options for converting DataFrames to dictionaries:

In [None]:
trafficDf.head(2).to_dict(orient='records')

Converting to JSON is quite similar:

In [None]:
trafficDf.head(2).to_json(orient='records')

## Reading Excel

To practice reading from Excel, let's load what we saved previously:

In [None]:
df2 = pd.read_excel(os.path.join(files_loc, "excel_output.xls"))

In [None]:
df2.head()

If you explore the columns in the database above, you'll realize we brought back some unexpected visitors (columns named "Unnamed", in this case mostly from the own row system in Excel). If you do not want them (for example, to preserve the original dataframe structure that we saved above), you can apply a filter like the following:

In [None]:
type(df2)

In [None]:
df2 = df2.loc[:, ~df2.columns.str.contains('^Unnamed')]
df2.head()

In [None]:
df2.shape

In [None]:
type(df2)

In [None]:
type(df2.columns.str.contains('^Unnamed'))

In [None]:
type(df2.columns)

But we could avoid cleaning up unnamed columns by loading the Excel the right way:

In [None]:
df2 = pd.read_excel(os.path.join(files_loc, "excel_output.xls"), index_col=0)
df2.head()

### The challenge

Let's download data from car accidents in Madrid straight from the source and explore the data a bit to understand the structure. You'll be doing an data consolidation exercise afterwards:

In [4]:
!wget https://datos.madrid.es/egob/catalogo/207831-0-accidentes-trafico.xls -P {files_loc}

--2020-12-23 10:00:39--  https://datos.madrid.es/egob/catalogo/207831-0-accidentes-trafico.xls
Resolving datos.madrid.es (datos.madrid.es)... 23.203.48.143, 23.203.48.140, 2600:1402:3800::1706:7621, ...
Connecting to datos.madrid.es (datos.madrid.es)|23.203.48.143|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://datos.madrid.es:443/egobfiles/MANUAL/207831/01_%20ACCIDENTES%20POR%20TIPO%20EN%20%20DISTRITOS.xls [following]
--2020-12-23 10:00:40--  https://datos.madrid.es/egobfiles/MANUAL/207831/01_%20ACCIDENTES%20POR%20TIPO%20EN%20%20DISTRITOS.xls
Reusing existing connection to datos.madrid.es:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/vnd.ms-excel]
Saving to: ‘/content/gdrive/MyDrive/pdsfiles/207831-0-accidentes-trafico.xls.6’

207831-0-accidentes     [ <=>                ]  60.50K  --.-KB/s    in 0.01s   

2020-12-23 10:00:41 (4.51 MB/s) - ‘/content/gdrive/MyDrive/pdsfiles/207831-0-accidentes-tra

Again, we store the full file path into a Python variable for convenience:

In [5]:
contents = !ls {files_loc}/*accidentes*
file_path = contents[0]

Then, we read the file and explore the structure:

In [6]:
df_accidentes = pd.read_excel(file_path)

df_accidentes.head(10)

Unnamed: 0,01. ACCIDENTES POR TIPO EN DISTRITOS,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11
0,,,,,,,,,,,,
1,,,,,,,,,,,,
2,,,,,,,,,,,,
3,,,,,,,,,,,,
4,Indicadores,Nº Accidentes,,,,,,,,,,
5,Año,2009,,,,,,,,,,Total
6,DISTRITO_ACCIDENTE,COLISIÓN DOBLE,COLISIÓN MÚLTIPLE,CHOQUE CON OBJETO FIJO,ATROPELLO,VUELCO,CAÍDA MOTOCICLETA,CAÍDA CICLOMOTOR,CAÍDA BICICLETA,CAÍDA VIAJERO BUS,OTRAS CAUSAS,
7,ARGANZUELA,389,54,126,75,9,43,9,4,5,8,722
8,BARAJAS,89,6,53,21,,12,2,1,2,2,188
9,CARABANCHEL,375,44,171,137,8,36,13,6,9,8,807


There are plenty of options we can use when reading an Excel file, here are some of them where we're selecting specific ranges of data and a named sheet from the full worksheet:

In [7]:
df_accidentes = pd.read_excel(file_path, sheet_name='2016', index_col=0, skiprows=7, skipfooter=1, usecols='A:L')

df_accidentes.head()

Unnamed: 0_level_0,COLISIÓN DOBLE,COLISIÓN MÚLTIPLE,CHOQUE CON OBJETO FIJO,ATROPELLO,VUELCO,CAÍDA MOTOCICLETA,CAÍDA CICLOMOTOR,CAÍDA BICICLETA,CAÍDA VIAJERO BUS,OTRAS CAUSAS,CAÍDA VEHÍCULO 3 RUEDAS
DISTRITO_ACCIDENTE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
ARGANZUELA,328,62,86,57,1.0,46,12,10,6,2.0,
BARAJAS,95,6,50,34,4.0,12,2,4,2,4.0,
CARABANCHEL,346,55,121,106,5.0,53,18,7,8,6.0,
CENTRO,444,23,172,144,,97,11,33,5,3.0,
CHAMARTIN,499,59,122,80,5.0,105,14,8,5,7.0,


In [8]:
df_accidentes.tail()

Unnamed: 0_level_0,COLISIÓN DOBLE,COLISIÓN MÚLTIPLE,CHOQUE CON OBJETO FIJO,ATROPELLO,VUELCO,CAÍDA MOTOCICLETA,CAÍDA CICLOMOTOR,CAÍDA BICICLETA,CAÍDA VIAJERO BUS,OTRAS CAUSAS,CAÍDA VEHÍCULO 3 RUEDAS
DISTRITO_ACCIDENTE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
TETUAN,330,43,70,91,,73,25,13,5,3.0,
USERA,206,27,74,75,2.0,16,2,5,4,1.0,
VICALVARO,70,4,38,22,4.0,15,2,3,2,1.0,
VILLA DE VALLECAS,149,7,42,54,5.0,26,4,4,1,3.0,
VILLAVERDE,177,11,71,54,1.0,8,5,6,2,1.0,


Let's have a look at the data from 2010:

In [9]:
accidentes_2010 = pd.read_excel(file_path, 
                                index_col=0, 
                                header=7, 
                                sheet_name='2010', 
                                skipfooter=1,
                                usecols='A:K')
accidentes_2010

Unnamed: 0_level_0,COLISIÓN DOBLE,COLISIÓN MÚLTIPLE,CHOQUE CON OBJETO FIJO,ATROPELLO,VUELCO,CAÍDA MOTOCICLETA,CAÍDA CICLOMOTOR,CAÍDA BICICLETA,CAÍDA VIAJERO BUS,OTRAS CAUSAS
DISTRITO_ACCIDENTE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
ARGANZUELA,403,48,103,75,7,26,15.0,2.0,5.0,6.0
BARAJAS,62,5,34,22,1,11,,,,
CARABANCHEL,357,35,133,114,7,22,14.0,6.0,7.0,1.0
CENTRO,546,43,142,169,2,60,14.0,9.0,13.0,9.0
CHAMARTIN,460,67,128,85,11,77,16.0,1.0,8.0,5.0
CHAMBERI,361,34,86,103,1,66,14.0,5.0,2.0,4.0
CIUDAD LINEAL,404,79,120,90,8,39,12.0,6.0,7.0,2.0
FUENCARRAL-EL PARDO,294,49,165,72,10,26,9.0,6.0,3.0,2.0
HORTALEZA,183,15,68,44,4,11,8.0,4.0,3.0,3.0
LATINA,273,37,93,94,6,24,4.0,7.0,10.0,4.0


Using `sheet_name=None` we load the full workbook. We can then access individual sheets by the sheet name as follows:

In [10]:
#Después de ver el archivo Excel, se aprecia que la info interesante en todas las hojas empiezan en la fila 8 (excepto en la hoja 2013, que empieza en la 5),
#por eso se indica "header=7".
#También se ve que la info útil va de la columna A a la K (incluidas), excepto en la hoja de 2016, hay una columna más por lo que llega hasta la columna L,
#por eso se indica "usecols='A:K".
#Por último, no se considera la información de la última fila, por eso se pone "skipfooter=1"
all_accidents = pd.read_excel(file_path, index_col=0, header=7, sheet_name=None, skipfooter=1, usecols='A:K')
all_accidents['2009'].head()

Unnamed: 0_level_0,COLISIÓN DOBLE,COLISIÓN MÚLTIPLE,CHOQUE CON OBJETO FIJO,ATROPELLO,VUELCO,CAÍDA MOTOCICLETA,CAÍDA CICLOMOTOR,CAÍDA BICICLETA,CAÍDA VIAJERO BUS,OTRAS CAUSAS
DISTRITO_ACCIDENTE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
ARGANZUELA,389,54,126,75,9.0,43,9,4.0,5,8
BARAJAS,89,6,53,21,,12,2,1.0,2,2
CARABANCHEL,375,44,171,137,8.0,36,13,6.0,9,8
CENTRO,514,55,171,143,,61,12,3.0,10,12
CHAMARTIN,494,70,133,92,10.0,72,22,2.0,6,7


Let's iterate over all the elements of the data:

In [None]:
for k, v in all_accidents.items():
    print(k,v,type(k),type(v))

What kind of structure are we dealing with here?

In [12]:
type(all_accidents)

dict

In [13]:
type(all_accidents['2009'])

pandas.core.frame.DataFrame

In [14]:
type(all_accidents['2009'].columns.str.contains('Unn'))

numpy.ndarray

We printed the values in our loop above. What about the keys? What kind of data structure is the values? Play around a bit with this, you'll need to understand the structure of what you got to tackle the next exercise.

In [15]:
all_accidents.keys()

dict_keys(['2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016'])

In [16]:
type(all_accidents.keys())

dict_keys

In [17]:
#Vemos que "all_accidents" es un diccionario cuyos "keys" son los nombres de los años y cuyos "values" son DFs.
type(all_accidents.values())

dict_values

#### Exercise

Consolidate the excel into one DataFrame:
- You will need to create a 'YEAR' column.
- Think how you can iterate through all the DataFrames.
- Explore also where you can take the value of 'YEAR' from, from the comments in the code above it should be clear by now.

In [18]:
years = all_accidents.keys()

for year in years:
  print(year, all_accidents[year].shape)


2009 (21, 10)
2010 (21, 10)
2011 (21, 10)
2012 (21, 10)
2013 (18, 10)
2014 (21, 10)
2015 (21, 10)
2016 (21, 10)


In [20]:
#El año 2013 presenta una anomalía: los datos originales empiezan en la fila 5, en lugar de en la fila 5, como el resto. 
#Por lo tanto hay que eliminar del dict el año 2013 e incluirlo bien.
del all_accidents['2013']

In [21]:
#Ahora leo el DF del 2013:
all_accidents['2013'] = pd.read_excel(file_path, sheet_name='2013', index_col=0, header=4, skipfooter=1, usecols='A:K', )

In [22]:
#Ahora aparece el 2013 al final del array "all_accidents.keys()".
all_accidents.keys()

dict_keys(['2009', '2010', '2011', '2012', '2014', '2015', '2016', '2013'])

In [23]:
for year in all_accidents.keys():
  print(year, all_accidents[year].shape)

2009 (21, 10)
2010 (21, 10)
2011 (21, 10)
2012 (21, 10)
2014 (21, 10)
2015 (21, 10)
2016 (21, 10)
2013 (21, 10)


In [None]:
all_accidents

In [25]:
#Ahora borramos la hoja 2016 y la volvemos a unir incluyendo la columna nueva
del all_accidents['2016']

In [26]:
all_accidents['2016'] = pd.read_excel(file_path, sheet_name='2016', index_col=0, header=7, skipfooter=1, usecols='A:L', )

In [27]:
all_accidents.keys()

dict_keys(['2009', '2010', '2011', '2012', '2014', '2015', '2013', '2016'])

In [28]:
for year in all_accidents.keys():
  print(year, all_accidents[year].shape)

2009 (21, 10)
2010 (21, 10)
2011 (21, 10)
2012 (21, 10)
2014 (21, 10)
2015 (21, 10)
2013 (21, 10)
2016 (21, 11)


In [None]:
all_accidents

In [30]:
#Ahora unimos concatenamos todos los DFs con la función pd.concat
all_accidents_tot = pd.concat(all_accidents.values(), keys=all_accidents.keys())

In [31]:
all_accidents_tot.head(30)

Unnamed: 0_level_0,Unnamed: 1_level_0,COLISIÓN DOBLE,COLISIÓN MÚLTIPLE,CHOQUE CON OBJETO FIJO,ATROPELLO,VUELCO,CAÍDA MOTOCICLETA,CAÍDA CICLOMOTOR,CAÍDA BICICLETA,CAÍDA VIAJERO BUS,OTRAS CAUSAS,CAÍDA VEHÍCULO 3 RUEDAS
Unnamed: 0_level_1,DISTRITO_ACCIDENTE,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2009,ARGANZUELA,389,54,126,75,9.0,43,9.0,4.0,5.0,8.0,
2009,BARAJAS,89,6,53,21,,12,2.0,1.0,2.0,2.0,
2009,CARABANCHEL,375,44,171,137,8.0,36,13.0,6.0,9.0,8.0,
2009,CENTRO,514,55,171,143,,61,12.0,3.0,10.0,12.0,
2009,CHAMARTIN,494,70,133,92,10.0,72,22.0,2.0,6.0,7.0,
2009,CHAMBERI,414,35,87,87,3.0,51,12.0,4.0,4.0,5.0,
2009,CIUDAD LINEAL,398,64,124,101,3.0,57,17.0,5.0,4.0,4.0,
2009,FUENCARRAL-EL PARDO,321,47,151,78,11.0,34,16.0,7.0,2.0,8.0,
2009,HORTALEZA,215,16,109,58,3.0,29,13.0,7.0,4.0,4.0,
2009,LATINA,298,40,134,85,7.0,24,10.0,2.0,9.0,11.0,


In [32]:
#Es importante ver que al concatenar, las columnas no comunes de los DFs se rellenan con NaN. 
#Además, con la opción escogida arriba, se incluye un doble índice: año y barrio.

In [33]:
type(all_accidents_tot)

pandas.core.frame.DataFrame

In [None]:
all_accidents_tot.index

In [52]:
#Vamos a jugar un poco con los dobles índices...
all_accidents_tot.iloc[(0,0)]

389

In [53]:
all_accidents_tot.iloc[0,0]

389

In [54]:
all_accidents_tot.iloc[0]

COLISIÓN DOBLE                              389.0
COLISIÓN MÚLTIPLE                            54.0
CHOQUE CON OBJETO FIJO                      126.0
ATROPELLO                                    75.0
VUELCO                                        9.0
CAÍDA MOTOCICLETA                            43.0
CAÍDA CICLOMOTOR                              9.0
CAÍDA BICICLETA                               4.0
CAÍDA VIAJERO BUS                             5.0
OTRAS CAUSAS                                  8.0
CAÍDA VEHÍCULO 3 RUEDAS                       NaN
Name: (2009, ARGANZUELA                    ), dtype: float64

In [55]:
all_accidents_tot.loc['2009', 'ARGANZUELA                    ']

COLISIÓN DOBLE                              389.0
COLISIÓN MÚLTIPLE                            54.0
CHOQUE CON OBJETO FIJO                      126.0
ATROPELLO                                    75.0
VUELCO                                        9.0
CAÍDA MOTOCICLETA                            43.0
CAÍDA CICLOMOTOR                              9.0
CAÍDA BICICLETA                               4.0
CAÍDA VIAJERO BUS                             5.0
OTRAS CAUSAS                                  8.0
CAÍDA VEHÍCULO 3 RUEDAS                       NaN
Name: (2009, ARGANZUELA                    ), dtype: float64

In [56]:
all_accidents_tot.index.size

168

In [57]:
all_accidents_tot.columns.size

11

In [58]:
#Vemos que hay un problema con el nombe de los barrios incluidos en los índices ya que hay muchos espacios al final
#Estaría muy bien poder arreglarlo...

In [59]:
# -----------------------------------------------------------------------------------------------------------------
# COMO MANEJARSE CON INDICES DOBLES ES UN POCO LIOSO, VOY A INCLUIR UNA COLUMNA "YEAR" EN CADA DF Y LUEGO LOS VOY A CONCATENAR 
# -----------------------------------------------------------------------------------------------------------------

In [41]:
#Hago una copia del dict all_accidents
all_accidents_2 = all_accidents.copy()

In [None]:
all_accidents_2

In [42]:
type(all_accidents_2['2009'])

pandas.core.frame.DataFrame

In [43]:
all_accidents_2.keys()

dict_keys(['2009', '2010', '2011', '2012', '2014', '2015', '2013', '2016'])

In [None]:
#Se incluye la columna "AÑO" en cada DF para luego poder ordenar (es el sustitutivo del MultiIndex)
for year in all_accidents_2.keys():
  all_accidents_2[year]['AÑO'] = year
all_accidents_2

In [64]:
type(all_accidents_2['2009'])

pandas.core.frame.DataFrame

In [65]:
type(all_accidents_2)

dict

In [66]:
all_accidents_2['2009'].head()

Unnamed: 0_level_0,COLISIÓN DOBLE,COLISIÓN MÚLTIPLE,CHOQUE CON OBJETO FIJO,ATROPELLO,VUELCO,CAÍDA MOTOCICLETA,CAÍDA CICLOMOTOR,CAÍDA BICICLETA,CAÍDA VIAJERO BUS,OTRAS CAUSAS,AÑO
DISTRITO_ACCIDENTE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
ARGANZUELA,389,54,126,75,9.0,43,9,4.0,5,8,2009
BARAJAS,89,6,53,21,,12,2,1.0,2,2,2009
CARABANCHEL,375,44,171,137,8.0,36,13,6.0,9,8,2009
CENTRO,514,55,171,143,,61,12,3.0,10,12,2009
CHAMARTIN,494,70,133,92,10.0,72,22,2.0,6,7,2009


In [67]:
all_accidents_2['2016'].head()

Unnamed: 0_level_0,COLISIÓN DOBLE,COLISIÓN MÚLTIPLE,CHOQUE CON OBJETO FIJO,ATROPELLO,VUELCO,CAÍDA MOTOCICLETA,CAÍDA CICLOMOTOR,CAÍDA BICICLETA,CAÍDA VIAJERO BUS,OTRAS CAUSAS,CAÍDA VEHÍCULO 3 RUEDAS,AÑO
DISTRITO_ACCIDENTE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
ARGANZUELA,328,62,86,57,1.0,46,12,10,6,2.0,,2016
BARAJAS,95,6,50,34,4.0,12,2,4,2,4.0,,2016
CARABANCHEL,346,55,121,106,5.0,53,18,7,8,6.0,,2016
CENTRO,444,23,172,144,,97,11,33,5,3.0,,2016
CHAMARTIN,499,59,122,80,5.0,105,14,8,5,7.0,,2016


In [68]:
#Ahora ya se pueden concatenar sin miedo a tener doble índice
all_accidents_tot_2 = pd.concat(all_accidents_2.values())

In [69]:
all_accidents_tot_2.head()

Unnamed: 0_level_0,COLISIÓN DOBLE,COLISIÓN MÚLTIPLE,CHOQUE CON OBJETO FIJO,ATROPELLO,VUELCO,CAÍDA MOTOCICLETA,CAÍDA CICLOMOTOR,CAÍDA BICICLETA,CAÍDA VIAJERO BUS,OTRAS CAUSAS,AÑO,CAÍDA VEHÍCULO 3 RUEDAS
DISTRITO_ACCIDENTE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
ARGANZUELA,389,54,126,75,9.0,43,9.0,4.0,5.0,8.0,2009,
BARAJAS,89,6,53,21,,12,2.0,1.0,2.0,2.0,2009,
CARABANCHEL,375,44,171,137,8.0,36,13.0,6.0,9.0,8.0,2009,
CENTRO,514,55,171,143,,61,12.0,3.0,10.0,12.0,2009,
CHAMARTIN,494,70,133,92,10.0,72,22.0,2.0,6.0,7.0,2009,


In [70]:
all_accidents_tot_2.tail()

Unnamed: 0_level_0,COLISIÓN DOBLE,COLISIÓN MÚLTIPLE,CHOQUE CON OBJETO FIJO,ATROPELLO,VUELCO,CAÍDA MOTOCICLETA,CAÍDA CICLOMOTOR,CAÍDA BICICLETA,CAÍDA VIAJERO BUS,OTRAS CAUSAS,AÑO,CAÍDA VEHÍCULO 3 RUEDAS
DISTRITO_ACCIDENTE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
TETUAN,330,43,70,91,,73,25.0,13.0,5.0,3.0,2016,
USERA,206,27,74,75,2.0,16,2.0,5.0,4.0,1.0,2016,
VICALVARO,70,4,38,22,4.0,15,2.0,3.0,2.0,1.0,2016,
VILLA DE VALLECAS,149,7,42,54,5.0,26,4.0,4.0,1.0,3.0,2016,
VILLAVERDE,177,11,71,54,1.0,8,5.0,6.0,2.0,1.0,2016,


In [71]:
#Ahora voy a intentar ordenar las columnas del DF, por orden descendente de año, pero ojo, sin modifcar el orden alfabético del índice

all_accidents_tot_2.sort_values(by='AÑO')


Unnamed: 0_level_0,COLISIÓN DOBLE,COLISIÓN MÚLTIPLE,CHOQUE CON OBJETO FIJO,ATROPELLO,VUELCO,CAÍDA MOTOCICLETA,CAÍDA CICLOMOTOR,CAÍDA BICICLETA,CAÍDA VIAJERO BUS,OTRAS CAUSAS,AÑO,CAÍDA VEHÍCULO 3 RUEDAS
DISTRITO_ACCIDENTE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
ARGANZUELA,389,54,126,75,9.0,43,9.0,4.0,5.0,8.0,2009,
VILLAVERDE,148,19,65,46,4.0,9,4.0,3.0,1.0,8.0,2009,
VILLA DE VALLECAS,142,6,67,36,3.0,8,4.0,1.0,2.0,3.0,2009,
VICALVARO,86,6,39,15,6.0,9,4.0,3.0,2.0,2.0,2009,
USERA,214,24,96,53,1.0,13,2.0,1.0,4.0,3.0,2009,
...,...,...,...,...,...,...,...,...,...,...,...,...
CARABANCHEL,346,55,121,106,5.0,53,18.0,7.0,8.0,6.0,2016,
BARAJAS,95,6,50,34,4.0,12,2.0,4.0,2.0,4.0,2016,
VILLA DE VALLECAS,149,7,42,54,5.0,26,4.0,4.0,1.0,3.0,2016,
LATINA,288,40,119,104,3.0,36,12.0,11.0,7.0,2.0,2016,


In [72]:
#Se observa que está desordenada en los índices, por lo tanto, no vale esta ordenación
#Probaremos otra cosa

In [109]:
#Confirmo que no se ha modificado el DF original
all_accidents_tot_2

Unnamed: 0_level_0,COLISIÓN DOBLE,COLISIÓN MÚLTIPLE,CHOQUE CON OBJETO FIJO,ATROPELLO,VUELCO,CAÍDA MOTOCICLETA,CAÍDA CICLOMOTOR,CAÍDA BICICLETA,CAÍDA VIAJERO BUS,OTRAS CAUSAS,AÑO,CAÍDA VEHÍCULO 3 RUEDAS
DISTRITO_ACCIDENTE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
ARGANZUELA,389,54,126,75,9.0,43,9.0,4.0,5.0,8.0,2009,
BARAJAS,89,6,53,21,,12,2.0,1.0,2.0,2.0,2009,
CARABANCHEL,375,44,171,137,8.0,36,13.0,6.0,9.0,8.0,2009,
CENTRO,514,55,171,143,,61,12.0,3.0,10.0,12.0,2009,
CHAMARTIN,494,70,133,92,10.0,72,22.0,2.0,6.0,7.0,2009,
...,...,...,...,...,...,...,...,...,...,...,...,...
TETUAN,330,43,70,91,,73,25.0,13.0,5.0,3.0,2016,
USERA,206,27,74,75,2.0,16,2.0,5.0,4.0,1.0,2016,
VICALVARO,70,4,38,22,4.0,15,2.0,3.0,2.0,1.0,2016,
VILLA DE VALLECAS,149,7,42,54,5.0,26,4.0,4.0,1.0,3.0,2016,


In [130]:
#En primer lugar veo que un índice con nombres duplicados no tiene mucho sentido, así que hago que ese índice se incorpore como una nueva columna.
#Creo una nueva varialbe para no perder info si haya problemas
all_accidents_tot_3 = all_accidents_tot_2.reset_index()

In [131]:
all_accidents_tot_3.head()

Unnamed: 0,DISTRITO_ACCIDENTE,COLISIÓN DOBLE,COLISIÓN MÚLTIPLE,CHOQUE CON OBJETO FIJO,ATROPELLO,VUELCO,CAÍDA MOTOCICLETA,CAÍDA CICLOMOTOR,CAÍDA BICICLETA,CAÍDA VIAJERO BUS,OTRAS CAUSAS,AÑO,CAÍDA VEHÍCULO 3 RUEDAS
0,ARGANZUELA,389,54,126,75,9.0,43,9.0,4.0,5.0,8.0,2009,
1,BARAJAS,89,6,53,21,,12,2.0,1.0,2.0,2.0,2009,
2,CARABANCHEL,375,44,171,137,8.0,36,13.0,6.0,9.0,8.0,2009,
3,CENTRO,514,55,171,143,,61,12.0,3.0,10.0,12.0,2009,
4,CHAMARTIN,494,70,133,92,10.0,72,22.0,2.0,6.0,7.0,2009,


In [132]:
all_accidents_tot_3.columns

Index(['DISTRITO_ACCIDENTE', 'COLISIÓN DOBLE                          ',
       'COLISIÓN MÚLTIPLE                       ',
       'CHOQUE CON OBJETO FIJO                  ',
       'ATROPELLO                               ',
       'VUELCO                                  ',
       'CAÍDA MOTOCICLETA                       ',
       'CAÍDA CICLOMOTOR                        ',
       'CAÍDA BICICLETA                         ',
       'CAÍDA VIAJERO BUS                       ',
       'OTRAS CAUSAS                            ', 'AÑO',
       'CAÍDA VEHÍCULO 3 RUEDAS                 '],
      dtype='object')

In [133]:
#Ahora ordeno las columnas como voy a querer que aparezca en el DF
cols = all_accidents_tot_3.columns.tolist()

In [134]:
cols

['DISTRITO_ACCIDENTE',
 'COLISIÓN DOBLE                          ',
 'COLISIÓN MÚLTIPLE                       ',
 'CHOQUE CON OBJETO FIJO                  ',
 'ATROPELLO                               ',
 'VUELCO                                  ',
 'CAÍDA MOTOCICLETA                       ',
 'CAÍDA CICLOMOTOR                        ',
 'CAÍDA BICICLETA                         ',
 'CAÍDA VIAJERO BUS                       ',
 'OTRAS CAUSAS                            ',
 'AÑO',
 'CAÍDA VEHÍCULO 3 RUEDAS                 ']

In [135]:
type(cols)

list

In [136]:
len(cols)

13

In [137]:
cols.index('AÑO')

11

In [138]:
cols[11]

'AÑO'

In [139]:
type(cols[11])

str

In [140]:
#Ojo con esto, que luego da problemas
cols[11:12]

['AÑO']

In [141]:
type(cols[11:12])

list

In [142]:
cols[0:11]

['DISTRITO_ACCIDENTE',
 'COLISIÓN DOBLE                          ',
 'COLISIÓN MÚLTIPLE                       ',
 'CHOQUE CON OBJETO FIJO                  ',
 'ATROPELLO                               ',
 'VUELCO                                  ',
 'CAÍDA MOTOCICLETA                       ',
 'CAÍDA CICLOMOTOR                        ',
 'CAÍDA BICICLETA                         ',
 'CAÍDA VIAJERO BUS                       ',
 'OTRAS CAUSAS                            ']

In [143]:
cols_2 = [cols[0]] + [cols[11]] + cols[1:11] + cols[12:]
cols_2

['DISTRITO_ACCIDENTE',
 'AÑO',
 'COLISIÓN DOBLE                          ',
 'COLISIÓN MÚLTIPLE                       ',
 'CHOQUE CON OBJETO FIJO                  ',
 'ATROPELLO                               ',
 'VUELCO                                  ',
 'CAÍDA MOTOCICLETA                       ',
 'CAÍDA CICLOMOTOR                        ',
 'CAÍDA BICICLETA                         ',
 'CAÍDA VIAJERO BUS                       ',
 'OTRAS CAUSAS                            ',
 'CAÍDA VEHÍCULO 3 RUEDAS                 ']

In [144]:
type(cols_2[1])

str

In [145]:
cols_2[2]

'COLISIÓN DOBLE                          '

In [146]:
cols_2[2].rstrip()

'COLISIÓN DOBLE'

In [176]:
#Ahora modifico el orden de las columnas
all_accidents_tot_4 = all_accidents_tot_3[cols_2]
all_accidents_tot_4.head()

Unnamed: 0,DISTRITO_ACCIDENTE,AÑO,COLISIÓN DOBLE,COLISIÓN MÚLTIPLE,CHOQUE CON OBJETO FIJO,ATROPELLO,VUELCO,CAÍDA MOTOCICLETA,CAÍDA CICLOMOTOR,CAÍDA BICICLETA,CAÍDA VIAJERO BUS,OTRAS CAUSAS,CAÍDA VEHÍCULO 3 RUEDAS
0,ARGANZUELA,2009,389,54,126,75,9.0,43,9.0,4.0,5.0,8.0,
1,BARAJAS,2009,89,6,53,21,,12,2.0,1.0,2.0,2.0,
2,CARABANCHEL,2009,375,44,171,137,8.0,36,13.0,6.0,9.0,8.0,
3,CENTRO,2009,514,55,171,143,,61,12.0,3.0,10.0,12.0,
4,CHAMARTIN,2009,494,70,133,92,10.0,72,22.0,2.0,6.0,7.0,


In [177]:
#Ahora voy a eliminar los espacios al final del nombre de las columnas
cols_3 = []

for item in cols_2:
  cols_3.append(item.rstrip())

In [178]:
cols_3

['DISTRITO_ACCIDENTE',
 'AÑO',
 'COLISIÓN DOBLE',
 'COLISIÓN MÚLTIPLE',
 'CHOQUE CON OBJETO FIJO',
 'ATROPELLO',
 'VUELCO',
 'CAÍDA MOTOCICLETA',
 'CAÍDA CICLOMOTOR',
 'CAÍDA BICICLETA',
 'CAÍDA VIAJERO BUS',
 'OTRAS CAUSAS',
 'CAÍDA VEHÍCULO 3 RUEDAS']

In [179]:
#Ahora modifico el nombre de las columnas
all_accidents_tot_4.columns = cols_3
all_accidents_tot_4.head()

Unnamed: 0,DISTRITO_ACCIDENTE,AÑO,COLISIÓN DOBLE,COLISIÓN MÚLTIPLE,CHOQUE CON OBJETO FIJO,ATROPELLO,VUELCO,CAÍDA MOTOCICLETA,CAÍDA CICLOMOTOR,CAÍDA BICICLETA,CAÍDA VIAJERO BUS,OTRAS CAUSAS,CAÍDA VEHÍCULO 3 RUEDAS
0,ARGANZUELA,2009,389,54,126,75,9.0,43,9.0,4.0,5.0,8.0,
1,BARAJAS,2009,89,6,53,21,,12,2.0,1.0,2.0,2.0,
2,CARABANCHEL,2009,375,44,171,137,8.0,36,13.0,6.0,9.0,8.0,
3,CENTRO,2009,514,55,171,143,,61,12.0,3.0,10.0,12.0,
4,CHAMARTIN,2009,494,70,133,92,10.0,72,22.0,2.0,6.0,7.0,


In [180]:
all_accidents_tot_4.columns

Index(['DISTRITO_ACCIDENTE', 'AÑO', 'COLISIÓN DOBLE', 'COLISIÓN MÚLTIPLE',
       'CHOQUE CON OBJETO FIJO', 'ATROPELLO', 'VUELCO', 'CAÍDA MOTOCICLETA',
       'CAÍDA CICLOMOTOR', 'CAÍDA BICICLETA', 'CAÍDA VIAJERO BUS',
       'OTRAS CAUSAS', 'CAÍDA VEHÍCULO 3 RUEDAS'],
      dtype='object')

In [181]:
#Ahora modifico el nombre "DISTRITO_ACCIDENTE"
all_accidents_tot_4.head()

Unnamed: 0,DISTRITO_ACCIDENTE,AÑO,COLISIÓN DOBLE,COLISIÓN MÚLTIPLE,CHOQUE CON OBJETO FIJO,ATROPELLO,VUELCO,CAÍDA MOTOCICLETA,CAÍDA CICLOMOTOR,CAÍDA BICICLETA,CAÍDA VIAJERO BUS,OTRAS CAUSAS,CAÍDA VEHÍCULO 3 RUEDAS
0,ARGANZUELA,2009,389,54,126,75,9.0,43,9.0,4.0,5.0,8.0,
1,BARAJAS,2009,89,6,53,21,,12,2.0,1.0,2.0,2.0,
2,CARABANCHEL,2009,375,44,171,137,8.0,36,13.0,6.0,9.0,8.0,
3,CENTRO,2009,514,55,171,143,,61,12.0,3.0,10.0,12.0,
4,CHAMARTIN,2009,494,70,133,92,10.0,72,22.0,2.0,6.0,7.0,


In [182]:
cols_4 = all_accidents_tot_4.columns.tolist()
cols_4

['DISTRITO_ACCIDENTE',
 'AÑO',
 'COLISIÓN DOBLE',
 'COLISIÓN MÚLTIPLE',
 'CHOQUE CON OBJETO FIJO',
 'ATROPELLO',
 'VUELCO',
 'CAÍDA MOTOCICLETA',
 'CAÍDA CICLOMOTOR',
 'CAÍDA BICICLETA',
 'CAÍDA VIAJERO BUS',
 'OTRAS CAUSAS',
 'CAÍDA VEHÍCULO 3 RUEDAS']

In [183]:
cols_4[cols_4.index('DISTRITO_ACCIDENTE')]  = 'DISTRITO'
cols_4

['DISTRITO',
 'AÑO',
 'COLISIÓN DOBLE',
 'COLISIÓN MÚLTIPLE',
 'CHOQUE CON OBJETO FIJO',
 'ATROPELLO',
 'VUELCO',
 'CAÍDA MOTOCICLETA',
 'CAÍDA CICLOMOTOR',
 'CAÍDA BICICLETA',
 'CAÍDA VIAJERO BUS',
 'OTRAS CAUSAS',
 'CAÍDA VEHÍCULO 3 RUEDAS']

In [184]:
all_accidents_tot_4.columns = cols_4
all_accidents_tot_4.head()

Unnamed: 0,DISTRITO,AÑO,COLISIÓN DOBLE,COLISIÓN MÚLTIPLE,CHOQUE CON OBJETO FIJO,ATROPELLO,VUELCO,CAÍDA MOTOCICLETA,CAÍDA CICLOMOTOR,CAÍDA BICICLETA,CAÍDA VIAJERO BUS,OTRAS CAUSAS,CAÍDA VEHÍCULO 3 RUEDAS
0,ARGANZUELA,2009,389,54,126,75,9.0,43,9.0,4.0,5.0,8.0,
1,BARAJAS,2009,89,6,53,21,,12,2.0,1.0,2.0,2.0,
2,CARABANCHEL,2009,375,44,171,137,8.0,36,13.0,6.0,9.0,8.0,
3,CENTRO,2009,514,55,171,143,,61,12.0,3.0,10.0,12.0,
4,CHAMARTIN,2009,494,70,133,92,10.0,72,22.0,2.0,6.0,7.0,


In [185]:
all_accidents_tot_4.DISTRITO[0]

'ARGANZUELA                    '

In [186]:
all_accidents_tot_4.DISTRITO.size

168

In [188]:
#Ahora voy a eliminar los espacios de la columna "DISTRITO"

for i in range(all_accidents_tot_4.DISTRITO.size):
  all_accidents_tot_4.DISTRITO[i] = all_accidents_tot_4.DISTRITO[i].rstrip()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  exec(code_obj, self.user_global_ns, self.user_ns)


In [189]:
all_accidents_tot_4.head()

Unnamed: 0,DISTRITO,AÑO,COLISIÓN DOBLE,COLISIÓN MÚLTIPLE,CHOQUE CON OBJETO FIJO,ATROPELLO,VUELCO,CAÍDA MOTOCICLETA,CAÍDA CICLOMOTOR,CAÍDA BICICLETA,CAÍDA VIAJERO BUS,OTRAS CAUSAS,CAÍDA VEHÍCULO 3 RUEDAS
0,ARGANZUELA,2009,389,54,126,75,9.0,43,9.0,4.0,5.0,8.0,
1,BARAJAS,2009,89,6,53,21,,12,2.0,1.0,2.0,2.0,
2,CARABANCHEL,2009,375,44,171,137,8.0,36,13.0,6.0,9.0,8.0,
3,CENTRO,2009,514,55,171,143,,61,12.0,3.0,10.0,12.0,
4,CHAMARTIN,2009,494,70,133,92,10.0,72,22.0,2.0,6.0,7.0,


In [190]:
all_accidents_tot_4.DISTRITO[0]

'ARGANZUELA'

In [192]:
all_accidents_tot_4.DISTRITO[167]

'VILLAVERDE'

In [193]:
#Ahora voy a ver qué cosas hay dentro de este DF
all_accidents_tot_4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 168 entries, 0 to 167
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   DISTRITO                 168 non-null    object 
 1   AÑO                      168 non-null    object 
 2   COLISIÓN DOBLE           168 non-null    int64  
 3   COLISIÓN MÚLTIPLE        168 non-null    int64  
 4   CHOQUE CON OBJETO FIJO   168 non-null    int64  
 5   ATROPELLO                168 non-null    int64  
 6   VUELCO                   154 non-null    float64
 7   CAÍDA MOTOCICLETA        168 non-null    int64  
 8   CAÍDA CICLOMOTOR         166 non-null    float64
 9   CAÍDA BICICLETA          162 non-null    float64
 10  CAÍDA VIAJERO BUS        154 non-null    float64
 11  OTRAS CAUSAS             159 non-null    float64
 12  CAÍDA VEHÍCULO 3 RUEDAS  1 non-null      float64
dtypes: float64(6), int64(5), object(2)
memory usage: 17.2+ KB


In [194]:
all_accidents_tot_4.describe()

Unnamed: 0,COLISIÓN DOBLE,COLISIÓN MÚLTIPLE,CHOQUE CON OBJETO FIJO,ATROPELLO,VUELCO,CAÍDA MOTOCICLETA,CAÍDA CICLOMOTOR,CAÍDA BICICLETA,CAÍDA VIAJERO BUS,OTRAS CAUSAS,CAÍDA VEHÍCULO 3 RUEDAS
count,168.0,168.0,168.0,168.0,154.0,168.0,166.0,162.0,154.0,159.0,1.0
mean,295.333333,36.422619,96.720238,73.02381,4.363636,41.797619,8.771084,7.907407,4.448052,3.90566,1.0
std,132.050057,20.784753,41.950345,31.252679,3.275614,27.729782,5.046031,6.855931,2.755961,2.530304,
min,62.0,1.0,28.0,15.0,1.0,5.0,1.0,1.0,1.0,1.0,1.0
25%,177.75,21.0,62.0,53.75,2.0,18.0,5.0,3.0,2.0,2.0,1.0
50%,321.0,35.0,87.0,72.0,3.0,36.0,8.5,6.0,4.0,3.0,1.0
75%,377.0,54.0,128.5,91.0,6.0,60.25,12.0,10.75,6.0,5.0,1.0
max,633.0,82.0,210.0,169.0,18.0,142.0,31.0,37.0,14.0,13.0,1.0


In [195]:
#Como todos los NaN se producen cuando el valor es 0, los sustituimos para poder operar con ellos
all_accidents_tot_4.fillna(0)

Unnamed: 0,DISTRITO,AÑO,COLISIÓN DOBLE,COLISIÓN MÚLTIPLE,CHOQUE CON OBJETO FIJO,ATROPELLO,VUELCO,CAÍDA MOTOCICLETA,CAÍDA CICLOMOTOR,CAÍDA BICICLETA,CAÍDA VIAJERO BUS,OTRAS CAUSAS,CAÍDA VEHÍCULO 3 RUEDAS
0,ARGANZUELA,2009,389,54,126,75,9.0,43,9.0,4.0,5.0,8.0,0.0
1,BARAJAS,2009,89,6,53,21,0.0,12,2.0,1.0,2.0,2.0,0.0
2,CARABANCHEL,2009,375,44,171,137,8.0,36,13.0,6.0,9.0,8.0,0.0
3,CENTRO,2009,514,55,171,143,0.0,61,12.0,3.0,10.0,12.0,0.0
4,CHAMARTIN,2009,494,70,133,92,10.0,72,22.0,2.0,6.0,7.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
163,TETUAN,2016,330,43,70,91,0.0,73,25.0,13.0,5.0,3.0,0.0
164,USERA,2016,206,27,74,75,2.0,16,2.0,5.0,4.0,1.0,0.0
165,VICALVARO,2016,70,4,38,22,4.0,15,2.0,3.0,2.0,1.0,0.0
166,VILLA DE VALLECAS,2016,149,7,42,54,5.0,26,4.0,4.0,1.0,3.0,0.0


In [197]:
 = all_accidents_tot_4.fillna(0)


In [198]:
all_accidents_tot_5.head()

Unnamed: 0,DISTRITO,AÑO,COLISIÓN DOBLE,COLISIÓN MÚLTIPLE,CHOQUE CON OBJETO FIJO,ATROPELLO,VUELCO,CAÍDA MOTOCICLETA,CAÍDA CICLOMOTOR,CAÍDA BICICLETA,CAÍDA VIAJERO BUS,OTRAS CAUSAS,CAÍDA VEHÍCULO 3 RUEDAS
0,ARGANZUELA,2009,389,54,126,75,9.0,43,9.0,4.0,5.0,8.0,0.0
1,BARAJAS,2009,89,6,53,21,0.0,12,2.0,1.0,2.0,2.0,0.0
2,CARABANCHEL,2009,375,44,171,137,8.0,36,13.0,6.0,9.0,8.0,0.0
3,CENTRO,2009,514,55,171,143,0.0,61,12.0,3.0,10.0,12.0,0.0
4,CHAMARTIN,2009,494,70,133,92,10.0,72,22.0,2.0,6.0,7.0,0.0


In [199]:
all_accidents_tot_5.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 168 entries, 0 to 167
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   DISTRITO                 168 non-null    object 
 1   AÑO                      168 non-null    object 
 2   COLISIÓN DOBLE           168 non-null    int64  
 3   COLISIÓN MÚLTIPLE        168 non-null    int64  
 4   CHOQUE CON OBJETO FIJO   168 non-null    int64  
 5   ATROPELLO                168 non-null    int64  
 6   VUELCO                   168 non-null    float64
 7   CAÍDA MOTOCICLETA        168 non-null    int64  
 8   CAÍDA CICLOMOTOR         168 non-null    float64
 9   CAÍDA BICICLETA          168 non-null    float64
 10  CAÍDA VIAJERO BUS        168 non-null    float64
 11  OTRAS CAUSAS             168 non-null    float64
 12  CAÍDA VEHÍCULO 3 RUEDAS  168 non-null    float64
dtypes: float64(6), int64(5), object(2)
memory usage: 17.2+ KB


In [206]:
all_accidents_tot_5.VUELCO = all_accidents_tot_5.VUELCO.astype('int64')


In [207]:
all_accidents_tot_5.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 168 entries, 0 to 167
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   DISTRITO                 168 non-null    object 
 1   AÑO                      168 non-null    object 
 2   COLISIÓN DOBLE           168 non-null    int64  
 3   COLISIÓN MÚLTIPLE        168 non-null    int64  
 4   CHOQUE CON OBJETO FIJO   168 non-null    int64  
 5   ATROPELLO                168 non-null    int64  
 6   VUELCO                   168 non-null    int64  
 7   CAÍDA MOTOCICLETA        168 non-null    int64  
 8   CAÍDA CICLOMOTOR         168 non-null    float64
 9   CAÍDA BICICLETA          168 non-null    float64
 10  CAÍDA VIAJERO BUS        168 non-null    float64
 11  OTRAS CAUSAS             168 non-null    float64
 12  CAÍDA VEHÍCULO 3 RUEDAS  168 non-null    float64
dtypes: float64(5), int64(6), object(2)
memory usage: 17.2+ KB


In [211]:
all_accidents_tot_5['CAÍDA CICLOMOTOR'] = all_accidents_tot_5['CAÍDA CICLOMOTOR'].astype('int64')
all_accidents_tot_5['CAÍDA BICICLETA'] = all_accidents_tot_5['CAÍDA BICICLETA'].astype('int64')
all_accidents_tot_5['CAÍDA VIAJERO BUS'] = all_accidents_tot_5['CAÍDA VIAJERO BUS'].astype('int64')
all_accidents_tot_5['OTRAS CAUSAS'] = all_accidents_tot_5['OTRAS CAUSAS'].astype('int64')
all_accidents_tot_5['CAÍDA VEHÍCULO 3 RUEDAS'] = all_accidents_tot_5['CAÍDA VEHÍCULO 3 RUEDAS'].astype('int64')

In [212]:
all_accidents_tot_5.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 168 entries, 0 to 167
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   DISTRITO                 168 non-null    object
 1   AÑO                      168 non-null    object
 2   COLISIÓN DOBLE           168 non-null    int64 
 3   COLISIÓN MÚLTIPLE        168 non-null    int64 
 4   CHOQUE CON OBJETO FIJO   168 non-null    int64 
 5   ATROPELLO                168 non-null    int64 
 6   VUELCO                   168 non-null    int64 
 7   CAÍDA MOTOCICLETA        168 non-null    int64 
 8   CAÍDA CICLOMOTOR         168 non-null    int64 
 9   CAÍDA BICICLETA          168 non-null    int64 
 10  CAÍDA VIAJERO BUS        168 non-null    int64 
 11  OTRAS CAUSAS             168 non-null    int64 
 12  CAÍDA VEHÍCULO 3 RUEDAS  168 non-null    int64 
dtypes: int64(11), object(2)
memory usage: 17.2+ KB


In [214]:
all_accidents_tot_5['AÑO'] = all_accidents_tot_5['AÑO'].astype('int64')

In [218]:
#Nos puede interesar la suma de todos los accidentes por año
all_accidents_tot_5.value_counts()

DISTRITO             AÑO   COLISIÓN DOBLE  COLISIÓN MÚLTIPLE  CHOQUE CON OBJETO FIJO  ATROPELLO  VUELCO  CAÍDA MOTOCICLETA  CAÍDA CICLOMOTOR  CAÍDA BICICLETA  CAÍDA VIAJERO BUS  OTRAS CAUSAS  CAÍDA VEHÍCULO 3 RUEDAS
VILLAVERDE           2016  177             11                 71                      54         1       8                  5                 6                2                  1             0                          1
CIUDAD LINEAL        2013  396             52                 128                     75         3       72                 10                6                4                  4             0                          1
FUENCARRAL-EL PARDO  2013  335             42                 159                     72         9       39                 7                 13               2                  4             0                          1
                     2012  317             49                 141                     78         18      36              

In [222]:
#Me doy cuenta de que es un problema también no tener índices dobles porque todo lo relativo a la estadística es un lío... Aunque no tanto...
all_accidents_tot_5.groupby(['AÑO']).sum()

Unnamed: 0_level_0,COLISIÓN DOBLE,COLISIÓN MÚLTIPLE,CHOQUE CON OBJETO FIJO,ATROPELLO,VUELCO,CAÍDA MOTOCICLETA,CAÍDA CICLOMOTOR,CAÍDA BICICLETA,CAÍDA VIAJERO BUS,OTRAS CAUSAS,CAÍDA VEHÍCULO 3 RUEDAS
AÑO,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2009,6498,790,2390,1582,98,768,231,76,93,134,0
2010,6118,745,2036,1570,110,691,188,81,105,84,0
2011,6238,776,2108,1566,76,803,185,114,75,62,0
2012,6170,784,1974,1496,92,774,187,126,69,74,0
2013,6157,723,1975,1486,67,927,137,140,76,61,0
2014,6056,769,1997,1483,94,987,150,232,89,78,0
2015,6121,736,1804,1525,60,983,166,239,85,63,0
2016,6258,796,1965,1560,75,1089,212,273,93,65,1


In [223]:
all_accidents_tot_5_year = all_accidents_tot_5.groupby(['AÑO']).sum()

In [224]:
type(all_accidents_tot_5_year)

pandas.core.frame.DataFrame

In [225]:
all_accidents_tot_5_year.index

Int64Index([2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016], dtype='int64', name='AÑO')

In [226]:
all_accidents_tot_5_year.describe()

Unnamed: 0,COLISIÓN DOBLE,COLISIÓN MÚLTIPLE,CHOQUE CON OBJETO FIJO,ATROPELLO,VUELCO,CAÍDA MOTOCICLETA,CAÍDA CICLOMOTOR,CAÍDA BICICLETA,CAÍDA VIAJERO BUS,OTRAS CAUSAS,CAÍDA VEHÍCULO 3 RUEDAS
count,8.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0
mean,6202.0,764.875,2031.125,1533.5,84.0,877.75,182.0,160.125,85.625,77.625,0.125
std,136.292752,26.962606,168.294332,40.931999,17.096366,138.015268,30.882265,76.683836,11.819324,24.277782,0.353553
min,6056.0,723.0,1804.0,1483.0,60.0,691.0,137.0,76.0,69.0,61.0,0.0
25%,6120.25,742.75,1971.75,1493.5,73.0,772.5,162.0,105.75,75.75,62.75,0.0
50%,6163.5,772.5,1986.0,1542.5,84.0,865.0,186.0,133.0,87.0,69.5,0.0
75%,6243.0,785.5,2054.0,1567.0,95.0,984.0,194.0,233.75,93.0,79.5,0.0
max,6498.0,796.0,2390.0,1582.0,110.0,1089.0,231.0,273.0,105.0,134.0,1.0


In [227]:
all_accidents_tot_5_year.std()/all_accidents_tot_5_year.mean()

COLISIÓN DOBLE             0.021976
COLISIÓN MÚLTIPLE          0.035251
CHOQUE CON OBJETO FIJO     0.082858
ATROPELLO                  0.026692
VUELCO                     0.203528
CAÍDA MOTOCICLETA          0.157238
CAÍDA CICLOMOTOR           0.169683
CAÍDA BICICLETA            0.478900
CAÍDA VIAJERO BUS          0.138036
OTRAS CAUSAS               0.312757
CAÍDA VEHÍCULO 3 RUEDAS    2.828427
dtype: float64

In [228]:
#Ojo con el resultado para "CAÍDA VEHICULO 3 RUEDAS": solo hay un valor y no tiene sentido la media, ni la std, ni nada...

In [247]:
#Nos puede interesar la suma de todos los accidentes por distrito y año 2009
all_accidents_tot_5.iloc[0]


DISTRITO                   ARGANZUELA
AÑO                              2009
COLISIÓN DOBLE                    389
COLISIÓN MÚLTIPLE                  54
CHOQUE CON OBJETO FIJO            126
ATROPELLO                          75
VUELCO                              9
CAÍDA MOTOCICLETA                  43
CAÍDA CICLOMOTOR                    9
CAÍDA BICICLETA                     4
CAÍDA VIAJERO BUS                   5
OTRAS CAUSAS                        8
CAÍDA VEHÍCULO 3 RUEDAS             0
Name: 0, dtype: object

In [248]:
#OJO CON LOS CAMPOS QUE NO QUEREMOS SUMAR
all_accidents_tot_5.iloc[0][2:].sum()

722

In [262]:
#Nos puede interesar la suma de todos los accidentes por distrito y todos los años
all_accidents_tot_5[all_accidents_tot_5.DISTRITO == 'ARGANZUELA']

Unnamed: 0,DISTRITO,AÑO,COLISIÓN DOBLE,COLISIÓN MÚLTIPLE,CHOQUE CON OBJETO FIJO,ATROPELLO,VUELCO,CAÍDA MOTOCICLETA,CAÍDA CICLOMOTOR,CAÍDA BICICLETA,CAÍDA VIAJERO BUS,OTRAS CAUSAS,CAÍDA VEHÍCULO 3 RUEDAS
0,ARGANZUELA,2009,389,54,126,75,9,43,9,4,5,8,0
21,ARGANZUELA,2010,403,48,103,75,7,26,15,2,5,6,0
42,ARGANZUELA,2011,373,48,102,61,3,45,14,5,3,0,0
63,ARGANZUELA,2012,376,49,77,66,5,42,10,2,3,1,0
84,ARGANZUELA,2014,335,54,78,59,0,60,4,15,7,4,0
105,ARGANZUELA,2015,340,60,66,58,2,39,6,15,7,7,0
126,ARGANZUELA,2013,360,50,79,63,2,54,6,7,6,4,0
147,ARGANZUELA,2016,328,62,86,57,1,46,12,10,6,2,0


In [None]:
#Para hacerlo, creo una variable con la lista de columnas que quiero

In [278]:
accident_types = all_accidents_tot_5.columns[2:]
accident_types

Index(['COLISIÓN DOBLE', 'COLISIÓN MÚLTIPLE', 'CHOQUE CON OBJETO FIJO',
       'ATROPELLO', 'VUELCO', 'CAÍDA MOTOCICLETA', 'CAÍDA CICLOMOTOR',
       'CAÍDA BICICLETA', 'CAÍDA VIAJERO BUS', 'OTRAS CAUSAS',
       'CAÍDA VEHÍCULO 3 RUEDAS'],
      dtype='object')

In [279]:
all_accidents_tot_5[all_accidents_tot_5.DISTRITO == 'ARGANZUELA'][accident_types].sum()

COLISIÓN DOBLE             2904
COLISIÓN MÚLTIPLE           425
CHOQUE CON OBJETO FIJO      717
ATROPELLO                   514
VUELCO                       29
CAÍDA MOTOCICLETA           355
CAÍDA CICLOMOTOR             76
CAÍDA BICICLETA              60
CAÍDA VIAJERO BUS            42
OTRAS CAUSAS                 32
CAÍDA VEHÍCULO 3 RUEDAS       0
dtype: int64

## Reading mysql database

Finally, let's read from the SQL database we created before:

In [None]:
df3 = pd.read_sql_query("SELECT * from traffic", conn)

In [None]:
df3.head()

# Additional References

[Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do)

[What is SciPy?](https://www.scipy.org/)

[How can SciPy be fast if it is written in an interpreted language like Python?](https://www.scipy.org/scipylib/faq.html#how-can-scipy-be-fast-if-it-is-written-in-an-interpreted-language-like-python)

[What is the difference between NumPy and SciPy?](https://www.scipy.org/scipylib/faq.html#what-is-the-difference-between-numpy-and-scipy)

[Linear Algebra for AI](https://github.com/fastai/fastai/blob/master/tutorials/linalg_pytorch.ipynb)