<!--Course_INFORMATION-->

<img align="left" style="padding-right:10px;" src="../../figures/Atos_Logo_RGB.jpg" width="100" height="50">
*This course contains the introduction to Python and Pandas, and is part of the traineeship Data Science from Atos Data Science University*
*

<!--NAVIGATION-->
< [Previous](0.Python_for_Data_Analysis.ipynb) | [Contents](Index.ipynb) | [Next](2.Groupby.ipynb) >

# Python for Data Science


### Loading Python libraries

Press Shift+Enter to execute the jupyter cell



In [None]:
#Import Python Libraries
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Pandas is a python package that deals mostly with :
- **Series**  (1d homogeneous array)
- **DataFrame** (2d labeled heterogeneous array) 
- **Panel** (general 3d array)

### Pandas Series

The first main data type we will learn about for pandas is the Series data type. Let's import Pandas and explore the Series object.

A Series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). What differentiates the NumPy array from a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn't need to hold numeric data, it can hold any arbitrary Python Object.

Let's explore this concept through some examples:

In [None]:
import numpy as np
import pandas as pd

Pandas *Series* is one-dimentional labeled array containing data of the same type (integers, strings, floating point numbers, Python objects, etc. ). The axis labels are often referred to as *index*.

### Creating a Series

You can convert a list,numpy array, or dictionary to a Series:

In [None]:
labels = ['a', 'b', 'c']
my_list = [10, 20, 30]
arr = np.array([10, 20, 30])
d = {'a': 10,'b': 20,'c': 30}

** Using Lists**

In [None]:
pd.Series(data = my_list)

In [None]:
pd.Series(data = my_list,
          index = labels)

In [None]:
pd.Series(my_list, labels)

** NumPy Arrays **

In [None]:
pd.Series(arr)

In [None]:
pd.Series(arr, labels)

** Dictionary**

In [None]:
pd.Series(d)

### Data in a Series

A pandas Series can hold a variety of object types:

In [None]:
pd.Series(data = labels)

In [None]:
# Even functions (although unlikely that you will use this)
pd.Series([sum, print, len])

## Using an Index

The key to using a Series is understanding its index. Pandas makes use of these index names or numbers by allowing for fast look ups of information (works like a hash table or dictionary).

Let's see some examples of how to grab information from a Series. Let us create two sereis, ser1 and ser2:

In [None]:
ser1 = pd.Series([1, 2, 3, 4], 
                 index = ['USA', 'Germany', 'USSR', 'Japan'])                                   

In [None]:
ser1

In [None]:
ser2 = pd.Series([1, 2, 5, 4], 
                 index = ['USA', 'Germany', 'Italy', 'Japan'])                                   

In [None]:
ser2

In [None]:
ser1['USA']

Operations are then also done based off of index:

In [None]:
ser1 + ser2

Let's stop here for now and move on to DataFrames, which will expand on the concept of Series!
## Great Job! <br>

Below is some extra stuff.....Please move to <br>

# Please move to: Pandas DataFrame


In [None]:
# Example of creating Pandas series :
s1 = pd.Series( np.random.randn(5) )
print(s1)

We did not pass any index, so by default, it assigned the indexes ranging from 0 to len(data)-1

In [None]:
# View index values
print(s1.index)

In [None]:
# Creating Pandas series with index:
s2 = pd.Series( np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'] )
print(s2)

In [None]:
# View index values
print(s2.index)

In [None]:
# Create a Series from dictionary
data = {'pi': 3.1415, 'e': 2.71828}  # dictionary
print(data)
s3 = pd.Series ( data )
print(s3)

In [None]:
# reordering the elements
s4 = pd.Series ( data, index = ['e', 'pi', 'tau'])
print(s4)

NAN (non a number) - is used to specify a missing value in Pandas.

In [None]:
# Creating a Pandas Series object from a single number:
s5 = pd.Series( 1, index = range(10), name='Ones')
print(s5)

In [None]:
s1

In [None]:
# Many ways to "slice" Pandas series (series have zero-based index by default):
print(s1)
s1[3]  # returns 4th element

In [None]:
s1[:2] # First 2 elements


In [None]:
print( s1[ [2,1,0]])  # Elements out of order

In [None]:
#Slicing series using index label (access series like a dictionary)

s4['pi']

In [None]:
dir(s4)

In [None]:
# Series can be used as ndarray:
print("Median:" , s4.median())

In [None]:
s1[s1 > 0]

In [None]:
# numpy functions can be used on series as usual:
s4[s4 > s4.median()]

In [None]:
# vector operations:
np.exp(s1)

In [None]:
# Unlike ndarray Series automatically allign the data based on label:
s5 = pd.Series (range(6))
print(s5)
s5[0:] + s5[::-1]

#### Popular Attributes and Methods:

|  Attribute/Method | Description |
|-----|-----|
| dtype | data type of values in series |
| empty | True if series is empty |
| size | number of elements |
| values | Returns values as ndarray |
| head() | First n elements |
| tail() | Last n elements |

*Exercise* 

In [None]:
# Create a series of your choice and explore it
# <your code goes here >
mys = pd.Series( np.random.randn(21))
print(mys)

In [None]:
mys.head()

In [None]:
mys.empty

# Pandas DataFrame

Pandas *DataFrame* is two-dimensional, size-mutable, heterogeneous tabular data structure with labeled rows and columns ( axes ). Can be thought of a dictionary-like container to store python Series objects. <br>
DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index. Let's use pandas to explore this topic!


In [None]:
import pandas as pd
import numpy as np

In [None]:
from numpy.random import randn
np.random.seed(101)

In [None]:
df = pd.DataFrame(randn(5, 4),
                  index = 'A B C D E'.split(),
                  columns = 'W X Y Z'.split())

In [None]:
df

## Selection and Indexing

Let's learn the various methods to grab data from a DataFrame

In [None]:
df['W']

In [None]:
# Pass a list of column names
df[['W', 'Z']]

In [None]:
# SQL Syntax (NOT RECOMMENDED!)
df.W

DataFrame Columns are just Series

In [None]:
type(df['W'])

**Creating a new column:**

In [None]:
df['new'] = df['W'] + df['Y']

In [None]:
df

** Removing Columns**

In [None]:
df.drop('new',
        axis = 1)

In [None]:
# Not inplace unless specified!
df

In [None]:
df.drop('new',
        axis = 1,
        inplace = True)

In [None]:
df

Can also drop rows this way:

In [None]:
df.drop('E',
        axis = 0)

** Selecting Rows**

In [None]:
df.loc['A']

Or select based off of position instead of label 

In [None]:
df.iloc[2]

** Selecting subset of rows and columns **

In [None]:
df.loc['B', 'Y']

In [None]:
df.loc[['A', 'B'],
       ['W', 'Y']]

### Conditional Selection

An important feature of pandas is conditional selection using bracket notation, very similar to numpy:

In [None]:
df

In [None]:
df > 0

In [None]:
df[df > 0]

In [None]:
df[df['W'] > 0]

In [None]:
df[df['W'] > 0]['Y']

In [None]:
df[df['W'] > 0][['Y', 'X']]

For two conditions you can use | and & with parenthesis:

In [None]:
df[(df['W'] > 0) & (df['Y'] > 1)]

## More Index Details

Let's discuss some more features of indexing, including resetting the index or setting it something else. We'll also talk about index hierarchy!

In [None]:
df

In [None]:
# Reset to default 0,1...n index
df.reset_index()

In [None]:
newind = 'CA NY WY OR CO'.split()

In [None]:
df['States'] = newind

In [None]:
df

In [None]:
df.set_index('States')

In [None]:
df

In [None]:
df.set_index('States',
             inplace = True)

In [None]:
df

## EXTRA.....Multi-Index and Index Hierarchy

Let us go over how to work with Multi-Index, first we'll create a quick example of what a Multi-Indexed DataFrame would look like:

In [None]:
# Index Levels
outside = ['G1', 'G1', 'G1', 'G2', 'G2', 'G2']
inside = [1, 2 , 3, 1, 2, 3]
hier_index = list(zip(outside,inside))
hier_index = pd.MultiIndex.from_tuples(hier_index)

In [None]:
hier_index

In [None]:
df = pd.DataFrame(np.random.randn(6, 2),
                  index=hier_index,columns = ['A', 'B'])
df

Now let's show how to index this! For index hierarchy we use df.loc[], if this was on the columns axis, you would just use normal bracket notation df[]. Calling one level of the index returns the sub-dataframe:

In [None]:
df.loc['G1']

In [None]:
df.loc['G1'].loc[1]

In [None]:
df.index.names

In [None]:
df.index.names = ['Group', 'Num']

In [None]:
df

In [None]:
df.xs('G1')

In [None]:
df.xs(['G1', 1])

In [None]:
df.xs(1,
      level = 'Num')

# Great Job!

### Excersizes

In [None]:
#Display a few first records
df.head()

---
*Excersize*

In [None]:
#Display first 10 records
# <your code goes here>

In [None]:
#Display first 20 records
# <your code goes here>

In [None]:
#Display the last 5 records
df.tail()
# <your code goes here>

---

In [None]:
#Identify the type of df object
type(df)

In [None]:
#Check the type of a column "States"
df['States'].dtype

In [None]:
#List the types of all columns
df.dtypes

In [None]:
#List the column names
df.columns

In [None]:
#List the row labels and the column names
df.axes

In [None]:
#Number of dimensions
df.ndim

In [None]:
#Total number of elements in the Data Frame
df.size

In [None]:
#Number of rows and columns
df.shape

In [None]:
#Output basic statistics for the numeric columns
df.describe()

In [None]:
#Calculate mean for all numeric columns
df.mean()

---
*Excersize*

In [None]:
#Number of rows and columns
df.shape

In [None]:
#Output basic statistics for the numeric columns
df.describe()

In [None]:
#Calculate mean for all numeric columns
df.mean()

---
*Excersize*

In [None]:
#Calculate the standard deviation (std() method) for all numeric columns
# <your code goes here>
df.std()

In [None]:
#Calculate average of the columns in the first 50 rows
# <your code goes here>

---
### Common Aggregation Functions:

|Function|Description
|-------|--------
|min   | minimum
|max   | maximum
|count   | number of non-null observations
|sum   | sum of values
|mean  | arithmetic mean of values
|median | median
|mad | mean absolute deviation
|mode | mode
|prod   | product of values
|std  | standard deviation
|var | unbiased variance



### Basic descriptive statistics

|Function|Description
|-------|--------
|min   | minimum
|max   | maximum
|mean  | arithmetic mean of values
|median | median
|mad | mean absolute deviation
|mode | mode
|std  | standard deviation
|var | unbiased variance
|sem | standard error of the mean
|skew| sample skewness
|kurt|kurtosis
|quantile| value at %


<!--NAVIGATION-->
< [Previous](0.Python_for_Data_Analysis.ipynb) | [Contents](Index.ipynb) | [Next](2.Groupby.ipynb) >