# Session 3 Recap

*Joachim Kahr Rasmussen*

## Recap (I/II) 

We can think of there as being two 'types' of plots:
- **Exploratory** plots: Figures for understanding data
    - Quick to produce $\sim$ minimal polishing
    - Interesting feature may by implied by the producer
    - Be careful showing these out of context
- **Explanatory** plots: Figures to convey a message
    - Polished figures
    - Direct attention to interesting feature in the data
    - Minimize risk of misunderstanding

There exist several packages for plotting. 

Some popular ones:
- `Matplotlib` is good for customization (explanatory plots)
    - Might take a lot of time when customizing!
- `Seaborn` and `Pandas` are good quick and dirty plots (exploratory)

## Recap (II/II) 

We need to put a lot of thinking in how to present data.

In particular, one must consider the *type* of data that is to be presented:

- One variable:
    - Categorical: Pie charts, simple counts, etc.
    - Numeric: Histograms, distplot (/cumulative) in seaborn


- Multiple variables:
    - `scatter` (matplotlib) or `jointplot` (seaborn) for (i) simple descriptives when (ii) both variables are numeric and (iii) there are not too many observations
    - `lmplot` (seaborn) when you also want to fit a linear model
    - `barplot` (matplotlib), `catplot` and `violinplot` (both seaborn) when one or more variables are categorical
    - the option `hue` allows you to add a "third" categorical dimension... use with care
    - Lots of other plot types and options. Go explore yourself!

- When you just want to explore: `pairplot` (seaborn) plots all pairwise correlations

## Questions?

<center><img src='https://media.giphy.com/media/moX4lN4FYlre4eyuVd/giphy.gif' alt="Drawing" style="width: 400px;"/></center>

# Session 4: Data structuring I

### The Pandas way

*Joachim Kahr Rasmussen*

# Small groups

Are you 1 or 2 pax in your group > come to me in break.


## Overview of Session

Today, we will work with `pandas` and how to structure your data. In particular, we will cover:
1. Why we structure data
2. Overview of Numpy and Panders
3. The Pandas Series
    - Working with series and numeric procedures
    - The special case of boolean series
4. More tools:
    - Inpecting and selecting observations
    - Modifying dataframes
    - Dataframe IO: Loading and storing

## Associated Readings

PDA, chapter 7:
- Handling missing data
- Data transformations: 
    - Duplicates 
    - Mapping
    - Replacing
    - Renaming axes
    - Binning
    - Filtering outliers
    - Dummies
- String manipulations

PDA, sections 11.1-11.2:
- Dates and time in Python
- Working with time series in pandas (time as index)

PDA, sections 12.1, 12.3:
- Working with categorical data in pandas
- Method chaining

PML, chapter 4, section 'Handling categorical data':
- Encoding class labels with `LabelEncoder`
- One-hot encoding

## Loading Stuff

In [1]:
# Loading packages
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import requests
import seaborn as sns

# Plotting style
plt.style.use('ggplot')
%matplotlib inline

# Adjusting plotting defaults
SMALL_SIZE = 16
MEDIUM_SIZE = 18
BIGGER_SIZE = 20

plt.rc('font', size=SMALL_SIZE)          # controls default text sizes
plt.rc('axes', titlesize=SMALL_SIZE)     # fontsize of the axes title
plt.rc('axes', labelsize=MEDIUM_SIZE)    # fontsize of the x and y labels
plt.rc('xtick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
plt.rc('ytick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
plt.rc('legend', fontsize=SMALL_SIZE)    # legend fontsize
plt.rc('figure', titlesize=BIGGER_SIZE)  # fontsize of the figure title

# Why We Structure Data

## Motivation
*Why do we want to learn data structuring?*

- Data rarely comes in the form of our model. We need to 'wrangle' our data.

*Can our machine learning models not do this for us?* 

- Not yet :). The current version needs **tidy** data. What is tidy? 

One row per observation.

<center><img src='https://raw.githubusercontent.com/abjer/sds2017/master/slides/figures/tidy.png'></center>

# Numpy and Pandas

## Numpy Overview
*What is the [`numpy`](http://www.numpy.org/) module?*

`numpy` is a Python module similar to matlab 
- fast and versatile for manipulating arrays
- linear algebra tools available
- used in some machine learning and statistics packages

Example from earlier sessions

In [2]:
table = [[1,2],[3,4]]
arr = np.array(table)
arr

array([[1, 2],
       [3, 4]])

## Pandas motivation
*Why use Pandas?*

It is built on numpy:
- Simplicity: Pandas is built with Python's simplicity 
- Powerful and fast tools for manipulating data from numpy

Improves on numpy:
- Clarity, flexibility by using labels (keys)
- Introduces lots of new, useful tools for data analysis (more on this)

The future: interesting development combining tools for big and small data

Note: Much more similar to common software for data manipulation like, say, Stata

## Pandas popularity

<center><img src='https://www.sqlshack.com/wp-content/uploads/2020/08/pandas-in-python-popularity-from-stack-overflow.png' alt="Drawing" style="width: 800px;"/></center>



## Pandas Data Types
*How do we work with data in Pandas?*

- We use two fundamental data stuctures: 
  - ``Series``;
  - ``DataFrame``.

## Pandas Data Frames (I/II)

*What is a `DataFrame`?*

- A 2d-array (matrix) with labelled columns and rows (which are called indices). Examples:

In [3]:
df = pd.DataFrame(data=arr,
                  columns=['A', 'B'])

df2 = pd.DataFrame(data=arr,
                  index=['i', 'ii'],
                  columns=['A', 'B'])

print(df, '\n\n', df2)

   A  B
0  1  2
1  3  4 

     A  B
i   1  2
ii  3  4


- An object with powerful data tools.

## Pandas Data Frames (II/II)

*How are pandas dataframes built?*

Pandas dataframes can be thought of as numpy arrays with some additional stuff.
- Note that columns can have different datatypes!

Most functions from `numpy` can be applied directly to Pandas. We can convert a DataFrame to a `numpy` array with `values` attribute.

In [4]:
df.values.tolist()

[[1, 2], [3, 4]]

*To note*: In Python we can describe it as a *list of lists* or a *dict of dicts*.

## Pandas Series
*What is a `Series`?*

- A vector/list with labels for each entry. Examples:

In [5]:
L = [1, 1.2, 'abc', True]

my_series = pd.Series(L)
my_series

0       1
1     1.2
2     abc
3    True
dtype: object

In [6]:
my_series.to_dict()

{0: 1, 1: 1.2, 2: 'abc', 3: True}

*What data structure does this remind us of?*



- A mix of Python list and dictionary (more info follows)

## Series vs DataFrames
*How are Series related to DataFrames?*

Every column is a series. Example, access as key (recommended):

In [7]:
print(df['B'])

0    2
1    4
Name: B, dtype: int32


Another option is access as object method... smart, but dangerous!

To illustrate, add one more column

In [8]:
df['count'] =  5
print(df)

   A  B  count
0  1  2      5
1  3  4      5


Print column 'B':

In [9]:
print(df.B)

0    2
1    4
Name: B, dtype: int32


Went well... Now, print column 'count':

In [10]:
print(df.count)

<bound method DataFrame.count of    A  B  count
0  1  2      5
1  3  4      5>



Clearly: The first option more robust as variables named same as methods, e.g. `count`, cannot be accesed.

## Indices and Column Names
*Why don't we just use numpy arrays and matrices?*


- Inspection of data is quicker
    - What was it that column 18 represented?

- Keep track of rows after deletion
    - Again.... What was it that column 18 represented!?

- Indices may contain fundamentally different data structures 
    - e.g. time series (more about this later)
    - Other datatypes (spatial data $\rightarrow$ advanced course)

- Facilitates complex operation (next session):
    - Merging datasets
    - Split-apply-combine (operations on subsets of data)
    - Method chaining (multiple operations in sequence)

# Working with Pandas Series

## Generating a Series (I/IV)
Let's revisit our series

In [11]:
my_series

0       1
1     1.2
2     abc
3    True
dtype: object

Components in series:

- `index`: label for each observation

- `values`: observation data

- `dtype`: the format of the series (`object` means any data type is allowed)
  - examples are fundamental datatypes (`float`, `int`, `bool`)  
      - in terms of precision: `float`>`int`>`bool`
      - this comes at a cost in the form of speed
  - note: the object `dtype` is SLOW!

## Generating a Series (II/IV)
*How do we set custom index?* 

Example:

In [12]:
num_data = range(0,3) # Generate data
indices = ['B', 'C', 'A'] # Generate index names
my_series2 = pd.Series(data=num_data, index=indices) # Create a pandas series from the two
my_series2

B    0
C    1
A    2
dtype: int64

## Generating a Series (III/IV)
*Can a dictionary be converted to a series?*

Yes, we just put into the Series class constructor. Example:

In [13]:
d = {'yesterday': 0, 'today': 1, 'tomorrow':3} # Create some dictionary
my_series3 = pd.Series(d) # Use the constructor
my_series3

yesterday    0
today        1
tomorrow     3
dtype: int64

Note: Same is true for DataFrames which requires that each value in the dictionary is also a dictionary. Example

In [14]:
djan = {'1st': 0, '2nd': 1, '3rd':3} # Create some dictionary for january
dfeb = {'1st': -3, '2nd': -1, '3rd':-2} # Create some dictionary for february
dmar = {'1st': 3, '2nd': 5, '3rd':4} # Create some dictionary for march

d = {'january': djan, 'february': dfeb, 'march': dmar} # Create dictionary of dictionaries
my_df1 = pd.DataFrame(d) # Use the constructor
my_df1

Unnamed: 0,january,february,march
1st,0,-3,3
2nd,1,-1,5
3rd,3,-2,4


What happens if keys are not the same? No big deal...

In [15]:
djan = {'1st': 0, '2nd': 1, '3rd':3} # Create some dictionary for january
dfeb = {'1st': -3, '2nd': -1, '3rd':-2} # Create some dictionary for february
dmar = {'1st': 3, '2nd': 5, '4th':4} # Create some dictionary for march

d = {'january': djan, 'february': dfeb, 'march': dmar} # Create dictionary of dictionaries
my_df2 = pd.DataFrame(d) # Use the constructor
my_df2

Unnamed: 0,january,february,march
1st,0.0,-3.0,3.0
2nd,1.0,-1.0,5.0
3rd,3.0,-2.0,
4th,,,4.0


## Generating a Series (IV/IV)
*Can we convert series to dictionaries?*

- Yes, in most cases. 

In [16]:
my_series3.to_dict()

{'yesterday': 0, 'today': 1, 'tomorrow': 3}

- **<font color="red">WARNING!#@</font>**: Series indices are NOT unique! Example:

In [100]:
s = pd.Series(range(3), index=['A','A', 'A']) # Create series with same indices
print(s) # Print series
print()
print(s.index.duplicated()) # Check duplicates
print()
print(s.to_dict()) # So translating to a dict gives...

A    0
A    1
A    2
dtype: int64

[False  True  True]

{'A': 2}


## The Power of Pandas
*How is the series different from a dict?*

- We will see that pandas Series have powerful methods and operations.
- It is both key and index  based (i.e. sequential).
    - Remember that unlike, say, lists, dictionaries are not sequential!

## Converting Data Types (I/II)

The data type of a series can be converted with the **astype** method. Example with a float:

In [98]:
print(my_series3.astype(np.str))
print()
print(my_series3.astype(np.str)*2)

yesterday    0
today        1
tomorrow     3
dtype: object

yesterday    00
today        11
tomorrow     33
dtype: object


## Converting Data Types (II/II)

Example with a string:

In [28]:
print(my_series3.astype(np.str))
print()
print(my_series3.astype(np.str)*2)

yesterday    0
today        1
tomorrow     3
dtype: object


yesterday    00
today        11
tomorrow     33
dtype: object


# Numeric Procedures

## Numeric Operations (I/III)
*More generally, how can we make basic arithmetic operations with arrays, series and dataframes?*

It really works just like with Python data! An example with squaring:

In [33]:
my_arr1 = np.array([2,3,2,1,1])
my_arr2 = my_arr1 ** 2

print(my_arr1)
print(my_arr2)

[2 3 2 1 1]
[4 9 4 1 1]


## Numeric Operations (II/III)
*Are other numeric python operators the same??*

Numeric operators work `/`, `//`, `-`, `*`, `**`  as expected.

So does comparative (`==`, `!=`, `>`, `<`)

*Why is this useful?*

- vectorized operations are VERY fast;
- requires very little code.

## Numeric Operations (III/III)
*Can we do the same with two vectors?*

- Yes, we can also do elementwise addition, multiplication, subtractions etc. of series. Example: 

In [34]:
my_arr1 + my_arr2

array([ 6, 12,  6,  2,  2])

## Numeric methods (I/IV)

Pandas series have powerful numeric methods built-in. Example of 10 mil. obs:

In [55]:
arr_rand = np.random.randn(10**7) # Draw 10^7 observations from standard normal , arr_rand = np.random.normal(size = 10**7)
s2 = pd.Series(arr_rand) # Convert to pandas series
s2.median() # Display median


-4.558314137490764e-05

Other useful methods include: `mean`, `quantile`, `min`, `max`, `std`, `describe`, `quantile` and many more.

In [56]:
np.round(s2.describe(),2) # Display other characteristics of distribution (rounded)

count    10000000.00
mean            0.00
std             1.00
min            -5.53
25%            -0.67
50%            -0.00
75%             0.67
max             5.19
dtype: float64

## Numeric methods (II/III)
An important method is `value_counts`. This counts number for each observation. 

Example:

In [77]:
cuts = np.arange(-7, 8, 1) # range from -10 to 10 with intervals of unit size
cats = pd.cut(s2, cuts) # cut into categorical data

In [78]:
cats.unique()

[(-1, 0], (0, 1], (-2, -1], (1, 2], (2, 3], ..., (3, 4], (-5, -4], (4, 5], (5, 6], (-6, -5]]
Length: 12
Categories (12, interval[int64]): [(-6, -5] < (-5, -4] < (-4, -3] < (-3, -2] ... (2, 3] < (3, 4] < (4, 5] < (5, 6]]

In [81]:
cats.value_counts()

(-1, 0]     3413632
(0, 1]      3412743
(1, 2]      1359912
(-2, -1]    1358592
(-3, -2]     214482
(2, 3]       213647
(3, 4]        13192
(-4, -3]      13181
(-5, -4]        308
(4, 5]          304
(5, 6]            4
(-6, -5]          3
(6, 7]            0
(-7, -6]          0
dtype: int64

What is observation in the value_counts output - index or data?

## Numeric methods (III/III)
*Are there other powerful numeric methods?*

Yes: examples include 
- `unique`, `nunique`: the unique elements and the count of unique elements
- `cut`, `qcut`: partition series into bins 
- `diff`: difference every two consecutive observations
- `cumsum`: cumulative sum
- `nlargest`, `nsmallest`: the n largest elements 
- `idxmin`, `idxmax`: index which is minimal/maximal 
- `corr`: correlation matrix

Check [series documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html) for more information.

# Boolean Series

## Logical Expression for Series (I/II)
*Can we test an expression for all elements?*

Yes: **==**, **!=** work for a single object or Series with same indices. Example:

In [97]:
print(my_series3)
print()
print(my_series3 == 0)

yesterday    0
today        1
tomorrow     3
dtype: int64

yesterday     True
today        False
tomorrow     False
dtype: bool


What datatype is returned? 


## Logical Expression in Series  (II/II)
*Can we check if elements in a series equal some element in a container?*

Yes, the `isin` method. Example:

In [96]:
my_rng = list(range(2))

print(my_rng)
print()
print(my_series3.isin(my_rng)) 

[0, 1]

yesterday     True
today         True
tomorrow     False
dtype: bool


## Power of Boolean Series (I/II)
*Can we combine boolean Series?*

Yes, we can use:
- the `&` operator (*and*)
- the `|` operator (*or*)

In [93]:
import seaborn as sns
titanic = sns.load_dataset('titanic')
print(titanic.loc[range(3),['survived', 'age', 'sex']])

   survived   age     sex
0         0  22.0    male
1         1  38.0  female
2         1  26.0  female


In [94]:
print(((titanic.sex == 'female') & (titanic.age >= 30)).head(3)) # selection by multiple columns

0    False
1     True
2    False
dtype: bool


What datatype is returned? 


## Power of Boolean Series (II/II)
*Why do we care for boolean series (and arrays)?*

Mainly because we can use them to select rows based on their content.

In [95]:
print(my_series3[my_series3<3])
print()
print(my_series3)

yesterday    0
today        1
dtype: int64

yesterday    0
today        1
tomorrow     3
dtype: int64


NOTE: Boolean selection is extremely useful for dataframes!!

# Inspecting and Selecting Observations

## Viewing Series and Dataframes
*How can we view the contents in our dataset?*
- We can use `print` on our dataset
- We can visualize patterns by plotting

## The Head and Tail

We select the *first* rows in a DataFrame or Series with the `head` method.

In [108]:
#arr = np.random.normal(size=[100])
#my_series7 = pd.Series(arr)
titanic.head(3)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True


The `tail` method selects the last observations in a DataFrame. 

## Row Selection (I/III)
*How can we select certain rows in a Series when for given index **keys**?* 

WIth the `loc` attribute. Example:

In [111]:
print(my_series3)

my_loc = ['today', 'tomorrow']
print()
print(my_series3.loc[my_loc])

yesterday    0
today        1
tomorrow     3
dtype: int64

today       1
tomorrow    3
dtype: int64


## Row selection (II/III)
*How can we select certain rows in a Series for given index **integers**?* 

The `iloc` method selects rows for provided index integers. 

In [112]:
print(titanic.iloc[10:15,:5])

    survived  pclass     sex   age  sibsp
10         1       3  female   4.0      1
11         1       1  female  58.0      0
12         0       3    male  20.0      0
13         0       3    male  39.0      1
14         0       3  female  14.0      0


Clearly, this is very similar to working with matrices in numpy! 

## Row selection (III/III)
*Do our tools for vieving specific rows, i.e. `loc`, `iloc` work for DataFrames?* 

- Yes, we can use both `loc` and `iloc`. As default they work the same.

In [190]:
my_idx = ['i', 'ii', 'iii']
my_cols = ['a','b']

my_data = np.arange(1,7) #my_data = [[1, 2], [3, 4], [5, 6]]
my_data = my_data.reshape(3,2)

my_df = pd.DataFrame(my_data, columns=my_cols, index=my_idx)

print(my_df)
print()
print(my_df.loc[['i','ii']])
print()
print(my_df.iloc[:2])

     a  b
i    1  2
ii   3  4
iii  5  6

    a  b
i   1  2
ii  3  4

    a  b
i   1  2
ii  3  4


## Columns Selection (I/II)
*How are `loc`, `iloc` different for DataFrames?* 

- For DataFrames, we can also specify columns.

In [191]:
idx_keep = ['i','ii']
cols_keep = ['a']
print(my_df.loc[idx_keep, cols_keep])

    a
i   1
ii  3


## Columns Selection (II/II)
*How can we generally select columns in a DataFrame?* 

- Option 1: using the `[]` and providing a list of columns.
- Option 2: using `loc` and setting row selection as `:`.

In [192]:
print(my_df.loc[:,['b']])

     b
i    2
ii   4
iii  6


## Selection quiz
*What does `:` do in `iloc` or `loc`?* 

Select all rows (columns).

# Modifying DataFrames

## Modyfying DataFrames
*Why do we want to modify DataFrames?*

- Because data rarely comes in the form we want it.


## Changing the Index (I/III)
*How can we change the index of a DataFrame?*

We change or set a DataFrame's index using its method `set_index`. Example:

In [193]:
print(my_df.set_index('a'))
print()
print(my_df)

   b
a   
1  2
3  4
5  6

     a  b
i    1  2
ii   3  4
iii  5  6


Clearly, doing so, we also implicitly delete the previous index.

Also, notice the level shift in *b* due to this.

## Changing the Index (II/III)
*Is our DataFrame changed? I.e. does it have a new index?*

No, we must overwrite it or make it into a new object:

In [194]:
print(my_df)
my_df_a = my_df.set_index('a')
print()
print(my_df_a)
print()
print(my_df_a.iloc[1,0])

     a  b
i    1  2
ii   3  4
iii  5  6

   b
a   
1  2
3  4
5  6

4


## Changing the index (III/III)

Sometimes we wish to remove the index. This is done with the `reset_index` method:

In [195]:
print(my_df_a.reset_index()) # drop=True
print()
print(my_df_a.reset_index(drop=True)) # drop=True
print()
print(my_df)

   a  b
0  1  2
1  3  4
2  5  6

   b
0  2
1  4
2  6

     a  b
i    1  2
ii   3  4
iii  5  6


The old indices cannot be restored (that information was lost), but the interim index is by default made into a new variable.

By specifying the keyword `drop`=True we delete this index.

*To note:* Indices can have multiple levels, in this case `level` can be specified to delete a specific level.

## Changing the Column Names

Column names can be changed with

In [196]:
print(my_df)
my_df.columns = ['A', 'B']
print()
print(my_df)

     a  b
i    1  2
ii   3  4
iii  5  6

     A  B
i    1  2
ii   3  4
iii  5  6


DataFrame's also have the function called `rename`.

In [197]:
print(my_df)
my_df.columns = ['A', 'B']
print()
print(my_df)
my_df = my_df.rename(columns={'A': 'Aa'})
print()
print(my_df)

     A  B
i    1  2
ii   3  4
iii  5  6

     A  B
i    1  2
ii   3  4
iii  5  6

     Aa  B
i     1  2
ii    3  4
iii   5  6


## Changing all Column Values
*How can we can update values in a DataFrame?*

In [198]:
print(my_df)

# # set uniform value
my_df['B'] = 3
print()
print(my_df)

# set different values
my_df['B'] = [2,17,0] 
print()
print(my_df)

     Aa  B
i     1  2
ii    3  4
iii   5  6

     Aa  B
i     1  3
ii    3  3
iii   5  3

     Aa   B
i     1   2
ii    3  17
iii   5   0


## Changing Specific Column Values
*How can we can update values in a DataFrame?*

In [202]:
print(my_df)

# loc, iloc
my_loc2 = ['i', 'iii']
my_df.loc[my_loc2, 'Aa'] = 10

print()
print(my_df)

     Aa   B
i    10   2
ii    3  17
iii  10   0

     Aa   B
i    10   2
ii    3  17
iii  10   0


## Sorting Data

A DataFrame can be sorted with `sort_values`; this method takes one or more columns to sort by. 

In [203]:
print(my_df.sort_values(by='Aa', ascending=True))

     Aa   B
ii    3  17
i    10   2
iii  10   0


*To note:* Many key word arguments are possible for sort_values, including ascending if for one or more valuable we want descending values. 

Sorting by index is possible with `sort_index`.

In [204]:
print(my_df.sort_index())

     Aa   B
i    10   2
ii    3  17
iii  10   0


# DataFrame IO: Loading and Storing

## Reading DataFrames (I/II)

Download the file from [URL](https://api.statbank.dk/v1/data/FOLK1A/CSV?lang=en&Tid=*). Open directly in Pandas.

In [216]:
url = 'https://api.statbank.dk/v1/data/FOLK1A/CSV?lang=en&Tid=*'
df = pd.read_csv(url, sep=';') # open the file as dataframe
print(df.head(5))

      TID  INDHOLD
0  2008Q1  5475791
1  2008Q2  5482266
2  2008Q3  5489022
3  2008Q4  5505995
4  2009Q1  5511451


Tomorrow we'll learn how to parse time column!

## Reading DataFrames (II/II)

Now let's try opening the file from the [URL](https://api.statbank.dk/v1/data/FOLK1A/CSV?lang=en&Tid=*) as a local file:

In [220]:
abs_path = 'C:/Users/xtw562/Downloads/FOLK1A.csv' # absolute path 
rel_path = 'FOLK1A.csv' # relative path 

df = pd.read_csv(rel_path, sep=';') # open the file as dataframe
print(df.head(2))

      TID  INDHOLD
0  2008Q1  5475791
1  2008Q2  5482266


- absolute path: entire path starting from which disk etc.
- relative paths: from where your program, i.e. Jupyter is

## Reading Other Data Types

Other pandas readers include:  excel, sql, sas, stata and many more.

## Storing Data

Data can be stored in a particular format with to_(FORMAT) where (FORMAT) is the file type such as csv. Let's try with to_csv:



In [221]:
df.to_csv('DST_people_count.csv', index=False)

Should we always set `index=False`? 

Usually, but maybe indices contain information, e.g. in time series or after groupby operation. 