---
<center><h1>Basic intro into pandas</h1></center>


<center><h2> Introduction to pandas data structures</h2></center>

---

[Рandas](http://pandas.pydata.org/pandas-docs/stable/) is a powerfull and flexible open source Python library for data analysis. Python has long been great for data preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling us to carry out our entire data analysis workflow in Python without having to switch to a more domain specific language like R or loading of working data into a database and using SQL (or worse, Excel). pandas makes Python great for analysis.

Library Highlights:

- A fast and efficient DataFrame object for data manipulation with integrated indexing (like a spreadsheet, but better in many ways);
- Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the HDF5 format;
- Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;
- Flexible reshaping and pivoting of data sets;
- Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;
- Columns can be inserted and deleted from data structures for size mutability;
- Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets;
- High performance merging and joining of data sets;
- Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;
- Time series-functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;
- Highly optimized for performance, with critical code paths written in Cython or C.
- Python with pandas is in use in a wide variety of academic and commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, Geospatial Sciences, and more.


# Table of Contents
- [Introduction to pandas data structures](#Introduction-to-pandas-data-structures)
    * [Series](#Series)
    * [DataFrames](#DataFrames)
    * [Read data from files and write data to files](#Read-data-from-files-and-write-data-to-files)
    - [*Exercise 1*](#Exercise-1)
    - [*Exercise 2*](#Exercise-2)

To use pandas you should simply import it as Python module. Usually together with NumPy (which we learned about last week).

In [1]:
import pandas as pd
import numpy as np
import random

## Introduction to pandas data structures

[[back to top]](#Table-of-Contents)

### Series

[[back to top]](#Table-of-Contents)

Pandas introduces two new data structures to Python – Series and DataFrame, both of which are built on top of NumPy.
A Series is a one-dimensional object similar to an array, list, or column in a table. It will assign a labeled index to each item in the Series. By default, each item will receive an index label from 0 to N, where N is the length of the Series minus one.

In [2]:
my_first_series = pd.Series([1, 'hello, world', np.nan, -1234567890, 3.14, 0])
my_first_series

0               1
1    hello, world
2             NaN
3     -1234567890
4            3.14
5               0
dtype: object

You can also set specific index at creating the Series.  Doing this makes it more like a python dictionary (associative array).

In [3]:
my_first_series = pd.Series([1, 'hello, world', np.nan, -1234567890, 3.14, 0], index=['A', 'B', 'unknown', 0, 'C', 'D'])
my_first_series

A                     1
B          hello, world
unknown             NaN
0           -1234567890
C                  3.14
D                     0
dtype: object

In fact, the Series constructor can convert a Python dictionary, using the keys of the dictionary as its index.

In [4]:
my_dict = {'John': 10, 'Annet': 12, 'Robert': 5, 'Jack': 55}
my_first_series = pd.Series(my_dict)
my_first_series

John      10
Annet     12
Robert     5
Jack      55
dtype: int64

Then you can use the index to select necessary items from the Series in a way nearly identical to using a python dictionary.

In [5]:
my_dict['Jack']

55

In [6]:
my_first_series['Jack']

55

however, you can also ask for more than one index in series by passing a list of index values.

In [7]:
my_first_series[['Jack', 'Robert']]

Jack      55
Robert     5
dtype: int64

To see all indexes of the Series you may use index attribute

In [8]:
my_first_series.index

Index(['John', 'Annet', 'Robert', 'Jack'], dtype='object')

Similarly, you may display only values

In [9]:
my_first_series.values

array([10, 12,  5, 55], dtype=int64)

pandas provides the method [`rename({old_name_1: new_name_1, old_name_2: new_name_2, ... })`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html) that returns a new table allowing easily change index names.

We also can filter the series. Let’s find all items, which are less than 20 and odd:

In [10]:
my_first_series[(my_first_series < 20) & (my_first_series % 2 != 0)]

Robert    5
dtype: int64

You may achieve the same result at using Python dictionary data structures `my_dict`, which we have used before, but this would invlove programming a loop, for example, in such way

In [11]:
filtered = {}
for key, val in my_dict.items():
    if val < 20 and val % 2 != 0:
        filtered[key] = val
filtered

{'Robert': 5}

There is a shorter way to write the loop using generators for dict, which we did not cover ealier

In [12]:
filtered = {key: val for key, val in my_dict.items() if val < 20 and val % 2 != 0}
filtered

{'Robert': 5}

The previous example demonstrates only one of huge amount of advantages of pandas’ potential over pure Python. 
Series is a mutable data structures and you can easily change any item’s value:

In [13]:
print("Robert's previous value : {}".format(my_first_series['Robert']))
my_first_series['Robert'] = 15
print("Robert's new value : {}".format(my_first_series['Robert']))

Robert's previous value : 5
Robert's new value : 15


or add new values:

In [14]:
#my_first_series = my_first_series.append(pd.Series({'Joshua': 0, 'George': 200}))
my_first_series

John      10
Annet     12
Robert    15
Jack      55
dtype: int64

If it is necessary to apply any mathematical operation to Series items, it is simmilar to how you would do the same to an array. You may done it like below:

In [15]:
my_first_series_2 = my_first_series / 1.5
my_first_series_2

John       6.666667
Annet      8.000000
Robert    10.000000
Jack      36.666667
dtype: float64

Thus the corresponding mathematical operation is applied for each Series item like at using loops for Python lists.
In the same way you may add, substitute etc. two or more Series:

In [16]:
my_first_series_total = my_first_series + my_first_series_2
my_first_series_total

John      16.666667
Annet     20.000000
Robert    25.000000
Jack      91.666667
dtype: float64

`NULL`/`NaN` checking can be performed with `isnull()` and `notnull()`.

In [17]:
my_first_series_total.notnull()
my_first_series[my_first_series_total.notnull()]

John      10
Annet     12
Robert    15
Jack      55
dtype: int64

In [18]:
my_first_series_total.isnull()

John      False
Annet     False
Robert    False
Jack      False
dtype: bool

### DataFrames
[[back to top]](#Table-of-Contents)

A DataFrame is a tabular data structure comprised of rows and columns which is too closest to a spreadsheet, database table etc. It is a primary data structure in pandas as Series.  We can consider a DataFrame as a group of Series objects that share an index (the column names, simmilar to a dictionary of dictionaries). Arithmetic operations align on both row and column labels.

One of the simplest ways for creation of a DataFrame out of common Python data structures is the passing a dictionary of lists to the DataFrame constructor. To order columns we may use columns parameter, because by default the DataFrame constructor will order the columns alphabetically.

Let’s create the DataFrame for the list of finals matches of World Cup, their locations, the finalists and final scores beginning from 1990:

In [19]:
data = {'year': [1990, 1994, 1998, 2002, 2006, 2010, 2014],
        'winner': ['Germany', 'Brazil', 'France', 'Brazil','Italy', 'Spain', 'Germany'],
        'runner-up': ['Argentina', 'Italy', 'Brazil','Germany', 'France', 'Netherlands', 'Argentina'],
        'final score': ['1-0', '0-0 (pen)', '3-0', '2-0', '1-1 (pen)', '1-0', '1-0'] }
world_cup = pd.DataFrame(data, columns=data.keys())
world_cup

Unnamed: 0,year,winner,runner-up,final score
0,1990,Germany,Argentina,1-0
1,1994,Brazil,Italy,0-0 (pen)
2,1998,France,Brazil,3-0
3,2002,Brazil,Germany,2-0
4,2006,Italy,France,1-1 (pen)
5,2010,Spain,Netherlands,1-0
6,2014,Germany,Argentina,1-0


Other recipe to set a DataFrame is the using of Python list of dictionaries:

In [20]:
data_2 = [{'year': 1990, 'winner': 'Germany', 'runner-up': 'Argentina', 'final score': '1-0'}, 
          {'year': 1994, 'winner': 'Brazil', 'runner-up': 'Italy', 'final score': '0-0 (pen)'},
          {'year': 1998, 'winner': 'France', 'runner-up': 'Brazil', 'final score': '3-0'}, 
          {'year': 2002, 'winner': 'Brazil', 'runner-up': 'Germany', 'final score': '2-0'}, 
          {'year': 2006, 'winner': 'Italy','runner-up': 'France', 'final score': '1-1 (pen)'}, 
          {'year': 2010, 'winner': 'Spain', 'runner-up': 'Netherlands', 'final score': '1-0'}, 
          {'year': 2014, 'winner': 'Germany', 'runner-up': 'Argentina', 'final score': '1-0'}
         ]
world_cup = pd.DataFrame(data_2)
world_cup

Unnamed: 0,year,winner,runner-up,final score
0,1990,Germany,Argentina,1-0
1,1994,Brazil,Italy,0-0 (pen)
2,1998,France,Brazil,3-0
3,2002,Brazil,Germany,2-0
4,2006,Italy,France,1-1 (pen)
5,2010,Spain,Netherlands,1-0
6,2014,Germany,Argentina,1-0


If you want to see only 3 first rows of the previous table, you may use method `head(n)`, where `n` corresponds to the number of first rows of the table.

Note, that the expression `head()` is equal to `head(5)`.

In [21]:
world_cup.head(3)

Unnamed: 0,year,winner,runner-up,final score
0,1990,Germany,Argentina,1-0
1,1994,Brazil,Italy,0-0 (pen)
2,1998,France,Brazil,3-0


There is method `tail(n)`, which works like `head(n)`, but return the last `n` rows of the DataFrame:

In [22]:
world_cup.tail(2)

Unnamed: 0,year,winner,runner-up,final score
5,2010,Spain,Netherlands,1-0
6,2014,Germany,Argentina,1-0


Here you can also use well-know Python slices:

In [23]:
world_cup[2:5]

Unnamed: 0,year,winner,runner-up,final score
2,1998,France,Brazil,3-0
3,2002,Brazil,Germany,2-0
4,2006,Italy,France,1-1 (pen)


### Read data from files and write data to files
[[back to top]](#Table-of-Contents)

Too often we have the necessity to work with a dataset saved in specific text format file (txt, CSV, JSON etc.) or database (MySQL, particularly). pandas allows us to convert data from any file or database (this point we will consider in the following part of the post series about pandas) to DataFrame. Let’s show how you may read and write a dataset for different types of files:

1\. CSV file(s) (“Comma Separated Values” is text format for presenting tabular data; each line of the file corresponds to one line of the table; the values of a single column are separated by comma separating character, in general):

**Reading:**

    df = pd.read_csv("path\to\the\csv\file\for\reading")
    
**Writing:**
    
    df.to_csv("path\to\the\folder\where\you\want\save\csv\file")
    
where you should set the absolute path to CSV file like `"C:/User/csv_file_with_data.csv"`

2\. Excel file(s) (\*.xls and \*.xlsx):  

**Reading:**

    df = pd.read_excel("path\to\the\excel\file\for\reading", "sheet_name")
    
**Writing:**

    df.to_excel("path\to\the\folder\where\you\want\save\excel\file")
    
where you should set the absolute path to Excel file and the sheet name like “Sheet1”

3\. txt  file(s) (txt file can be read as a CSV file with other separator (delimiter); we suppose below that columns are separated by tabulation):

**Reading:**

    df = pd.read_csv("path\to\the\txt\file\for\reading", sep='\t')
    
**Writing:**

    df.to_csv("path\to\the\folder\where\you\want\save\txt\file", sep='\t')
    
4\. JSON files (an open-standard format that uses human-readable text to transmit data objects consisting of attribute–value pairs. It is the most common data format used for asynchronous browser/server communication. By its view it is very similar to Python dictionary)

**Reading:**

    df = pd.read_json("path\to\the\json\file\for\reading", sep='\t')
    
**Writing:**

    df.to_json("path\to\the\folder\where\you\want\save\json\file", sep='\t')

Pay attention that compounded paths contain various separators (such "/" or such "\") on different operating system (OS). The best practise is usage of `os` Python library. Suppose we want save a DataFrame in the folder `target_folder` with such hierarchy:

    main_folder
    |----folder1
    |----folder2
         |----sub_folder1
         |----target_folder
         |----sub_folder3
    |----folder2
         |----sub_folder1
         
and call it as `"my_file.csv"`. The code for saving using `os` library is the following:

    import os
    df.to_csv(os.path.join("main_folder", "folder2", "target_folder", "my_file.csv")
    
thus, `os.path.join` concatenated all directories into one path independently on the OS.

Let's save the `world_cup` DataFrame as CSV and JSON files:

In [24]:
world_cup.to_csv("world_cup.csv") #, index=False, header=False)
print("DataFrame was written")

# Check whether the "world_cup.csv" exists
import os
print(os.path.exists("world_cup.csv"))

# Read the file 
print
with open("world_cup.csv") as f:
    print(f.read())

DataFrame was written
True
,year,winner,runner-up,final score
0,1990,Germany,Argentina,1-0
1,1994,Brazil,Italy,0-0 (pen)
2,1998,France,Brazil,3-0
3,2002,Brazil,Germany,2-0
4,2006,Italy,France,1-1 (pen)
5,2010,Spain,Netherlands,1-0
6,2014,Germany,Argentina,1-0



To save CSV file without index use `index=False` attribute.

In [25]:
world_cup.to_json("world_cup.json")
print("DataFrame was written")

# Check whether the "world_cup.json" exists
import os
print(os.path.exists("world_cup.json"))

# Read the file 
print
with open("world_cup.json") as f:
    print(f.read())

# Let's prettify print
import json
with open("world_cup.json") as f:
    content = json.load(f)
content

DataFrame was written
True
{"year":{"0":1990,"1":1994,"2":1998,"3":2002,"4":2006,"5":2010,"6":2014},"winner":{"0":"Germany","1":"Brazil","2":"France","3":"Brazil","4":"Italy","5":"Spain","6":"Germany"},"runner-up":{"0":"Argentina","1":"Italy","2":"Brazil","3":"Germany","4":"France","5":"Netherlands","6":"Argentina"},"final score":{"0":"1-0","1":"0-0 (pen)","2":"3-0","3":"2-0","4":"1-1 (pen)","5":"1-0","6":"1-0"}}


{'year': {'0': 1990, '1': 1994, '2': 1998, '3': 2002, '4': 2006, '5': 2010, '6': 2014}, 'winner': {'0': 'Germany', '1': 'Brazil', '2': 'France', '3': 'Brazil', '4': 'Italy', '5': 'Spain', '6': 'Germany'}, 'runner-up': {'0': 'Argentina', '1': 'Italy', '2': 'Brazil', '3': 'Germany', '4': 'France', '5': 'Netherlands', '6': 'Argentina'}, 'final score': {'0': '1-0', '1': '0-0 (pen)', '2': '3-0', '3': '2-0', '4': '1-1 (pen)', '5': '1-0', '6': '1-0'}}

And read the just saved CSV and JSON files to new DataFrames:

In [26]:
df_csv = pd.read_csv("world_cup.csv")
df_csv

Unnamed: 0.1,Unnamed: 0,year,winner,runner-up,final score
0,0,1990,Germany,Argentina,1-0
1,1,1994,Brazil,Italy,0-0 (pen)
2,2,1998,France,Brazil,3-0
3,3,2002,Brazil,Germany,2-0
4,4,2006,Italy,France,1-1 (pen)
5,5,2010,Spain,Netherlands,1-0
6,6,2014,Germany,Argentina,1-0


In [27]:
df_csv.set_index("year")

Unnamed: 0_level_0,Unnamed: 0,winner,runner-up,final score
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1990,0,Germany,Argentina,1-0
1994,1,Brazil,Italy,0-0 (pen)
1998,2,France,Brazil,3-0
2002,3,Brazil,Germany,2-0
2006,4,Italy,France,1-1 (pen)
2010,5,Spain,Netherlands,1-0
2014,6,Germany,Argentina,1-0


As you can see, the `df_csv` contains an additional index column `Unnamed: 0`. You can miss it using `index_col=0` attribute.

In [28]:
df_csv = pd.read_csv("world_cup.csv", index_col=0)
df_csv

Unnamed: 0,year,winner,runner-up,final score
0,1990,Germany,Argentina,1-0
1,1994,Brazil,Italy,0-0 (pen)
2,1998,France,Brazil,3-0
3,2002,Brazil,Germany,2-0
4,2006,Italy,France,1-1 (pen)
5,2010,Spain,Netherlands,1-0
6,2014,Germany,Argentina,1-0


In [29]:
df_json = pd.read_json("world_cup.json")
df_json

Unnamed: 0,year,winner,runner-up,final score
0,1990,Germany,Argentina,1-0
1,1994,Brazil,Italy,0-0 (pen)
2,1998,France,Brazil,3-0
3,2002,Brazil,Germany,2-0
4,2006,Italy,France,1-1 (pen)
5,2010,Spain,Netherlands,1-0
6,2014,Germany,Argentina,1-0


> ### Exercise 1

> - Rename `John` to `Barbara` in `my_first_series` and change value of `Jack` from `55` to `-10`.

In [30]:
# type your code here
my_first_series

John      10
Annet     12
Robert    15
Jack      55
dtype: int64

> ### Exercise 2

> - Find all positive values in `my_first_series` using filter options and write resulting Series to `positive` variable.

> - Create the new Series `new_series`, which contains all items from `my_first_series` and two new items `(Ashly, NaN)` and `(Lukas, -5)`. 

> - At first, add Series `new_series` and `my_first_series` and then multiply them. Try explaining why you have such results.

> - Save `new_series` as CSV file to the folder, where the current IPython notebook exists. Call this file as `"new_series.csv"`. 

In [31]:
# type your code here
positive = my_first_series[my_first_series > 0]
positive

John      10
Annet     12
Robert    15
Jack      55
dtype: int64