<div class="alert alert-block alert-info">

# **PANDAS**   IN 30 MIN                                                                      

<div class="alert alert-block alert-success">

#### Author -   **Wes McKinney**

## Presented by Manoj

* ###### Developer Wes McKinney started working on pandas in 2008 while at AQR Capital Management out of the need for a high performance, flexible tool to perform quantitative analysis on financial data. Before leaving AQR he was able to convince management to allow him to open source the library.
* ###### Another AQR employee, Chang She, joined the effort in 2012 as the second major contributor to the library.
* ###### In 2015, pandas signed on as a fiscally sponsored project of NumFOCUS, a 501(c)(3) nonprofit charity in the United States.

## <ins>**INTRODUCTION TO PANDAS** </ins> :

### PANDAS is used in different ways
1. DataFrame object for data manipulation with integrated indexing.
2. Tools for reading and writing data between in-memory data structures and different file formats.
3. Data alignment and integrated handling of missing data.
4. Reshaping and pivoting of data sets.
5. Label-based slicing, fancy indexing, and subsetting of large data sets.
6. Data structure column insertion and deletion.
7. Group by engine allowing split-apply-combine operations on data sets.
8. Data set merging and joining.
9. Hierarchical axis indexing to work with high-dimensional data in a lower-dimensional data structure.
10. Time series-functionality: Date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging.
11. Provides data filtration.
###### The library is highly optimized for performance, with critical code paths written in Cython or C.

<div class="alert alert-block alert-success">
    
### Here we are going to solve some problems using "pandas" .

#### <div class="alert alert-success"><font color=green>Import pandas and numpy</font>

In [1]:
import pandas as pd
import numpy as np

### <div class="alert alert-success"><font color=blue>Object Creation</font>
**Creating a Series by passing a list of values, letting pandas create a default integer index:**

In [8]:
number_series = pd.Series([1, 3, 5, np.nan, 6, 8])
print(number_series)

###### Creating a DataFrame by passing a NumPy array, with a datetime index and labeled columns:

In [127]:
dates = pd.date_range('20130101', periods=6)
#print(dates)

In [128]:
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
#print(df)

###### Creating a DataFrame by passing a dict of objects that can be converted to series-like.

In [129]:
df2 = pd.DataFrame({"A": pd.Timestamp('20130301'),
                 "B": ['Hyderabad','Chennai','Bangalore','Mumbai']})
#print(df2)

### <div class="alert alert-success"> <font color=blue>Viewing data</font>

In [130]:
df.head()


In [131]:
df.tail()

In [132]:
df.index

In [133]:
df.columns

In [134]:
df.describe()

In [135]:
df.T

In [136]:
df.sort_index(axis=1, ascending=False)

In [137]:
df.sort_values(by='B')

### <div class="alert alert-success"><font color=blue>Selection</font>

In [138]:
df['A']

In [139]:
df[0:3]

###### <font color=red>Selection by label</font> 

In [141]:
df.loc[dates[0]]

In [140]:
df.loc[:, ['A', 'B']]

In [142]:
df.at[dates[0], 'A']

###### <font color=red>Selectionby position</font>


In [143]:
df.iloc[3]

In [144]:
df.iloc[3:5, 0:2]

In [145]:
df.iloc[[1, 2, 4], [0, 2]]

In [146]:
df.iloc[1,1]

###### <font color=red>Boolean indexing</font>

In [147]:
df[df.A > 0]

In [148]:
df[df > 0]

###### Using the isin() method for filtering:

In [149]:
df2 = df.copy()
df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
#print(df2)

In [150]:
df2[df2['E'].isin(['two', 'four'])]

### <div class="alert alert-success"> <font color=blue>Missing data</font>
**pandas primarily uses the value np.nan to represent missing data. It is by default not included in computations. See the Missing Data section.**

In [151]:
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
df1.loc[dates[0]:dates[1], 'E'] = 1
#print(df1)

In [152]:
df1.dropna()

In [153]:
df1.fillna(value=5)

### <div class="alert alert-success"> <font color=blue>Operations</font>

In [154]:
df.mean()

### <div class="alert alert-success"> <font color=blue>Apply function</font>

In [155]:
df1.apply(np.cumsum)

###### String Methods

In [156]:
names = pd.Series(['Manoj', 'Srinivas', 'karthik', 'RaMEsh', 'SUREsh', np.nan, 'lion', 'dog', 'cat'])
#names.str.upper()
#names.str.lower()

###  <div class="alert alert-success"><font color=blue>Merge</font> 
<font color=red>**Concat**</font>

In [97]:
df = pd.DataFrame(np.random.randn(10, 4))

In [157]:
### Break it in to pieces
pieces = [df[:3], df[3:7], df[7:]]
#pieces

In [158]:
pd.concat(pieces)

###### <font color=red>**Join**</font>

In [100]:
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})

right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})

In [159]:
#print(right)
#print(left)

In [160]:
pd.merge(left, right, on='key')

In [103]:
left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})

In [161]:
pd.merge(left, right, on='key')

### <div class="alert alert-success"> <font color=blue>Append</font>

In [162]:
df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'])
#print(df)

In [163]:
s = df.iloc[3]
df.append(s, ignore_index=True)

### <div class="alert alert-success"> <font color=blue>Grouping</font>
**By “group by” we are referring to a process involving one or more of the following steps:**

* Splitting the data into groups based on some criteria
* Applying a function to each group independently
* Combining the results into a data structure

In [164]:
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
                         'foo', 'bar', 'foo', 'foo'],
                      'B': ['one', 'one', 'two', 'three',
                             'two', 'two', 'one', 'three'],
                       'C': np.random.randn(8),
                       'D': np.random.randn(8)})
#print(df)

In [165]:
df.groupby('A').sum()

In [166]:
df.groupby(['A', 'B']).sum()

###  <div class="alert alert-success"> <font color=blue>Pivot tables</font>

In [167]:
df = pd.DataFrame({'A': ['one', 'one', 'two', 'three'] * 3,
                   'B': ['A', 'B', 'C'] * 4,
                   'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                   'D': np.random.randn(12),
                   'E': np.random.randn(12)})
#print(df)   

In [168]:
pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])

### <div class="alert alert-success"><font color=blue>Plotting</font>

In [169]:
ts = pd.Series(np.random.randn(1000),
                   index=pd.date_range('1/1/2000', periods=1000))
#print(ts)

In [118]:
ts = ts.cumsum()
#print(ts)

In [170]:
ts.plot()

In [171]:
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
                     columns=['A', 'B', 'C', 'D'])
#print(df)    
df = df.cumsum()
df.plot()

### <div class="alert alert-success"><font color=blue>Getting data in/out</font>
<font color=red>**CSV**</font>

In [123]:
df.to_csv('foo.csv')

In [172]:
pd.read_csv('foo.csv')

###### <font color=red>**HDF5**</font>

* ***HDF5 for Python***
* The h5py package is a Pythonic interface to the HDF5 binary data format.

HDF5 lets you store huge amounts of numerical data, and easily manipulate that data from NumPy. For example, you can slice into multi-terabyte datasets stored on disk, as if they were real NumPy arrays. Thousands of datasets can be stored in a single file, categorized and tagged however you want.

In [125]:
df.to_hdf('foo.h5', 'df')

In [173]:
pd.read_hdf('foo.h5', 'df')