# <center>Introduction to Pandas</center>

![](https://pandas.pydata.org/_static/pandas_logo.png)

>pandas is an open source, BSD-licensed library providing **high-performance**, **easy-to-use data structures** and **data analysis tools** for the Python programming language.  Pandas stands for “Python Data Analysis Library”. According to the Wikipedia page on Pandas, “the name is derived from the term “panel data”, an econometrics term for multidimensional structured data sets.” 

## Why use Pandas?

What’s cool about Pandas is that it takes data (like a CSV or TSV file, or a SQL database) and creates a Python object with rows and columns called **data frame** that looks very similar to table in a statistical software (think Excel). This is so much easier to work with in comparison to working with lists and/or dictionaries through for loops or list comprehension!

So, in a way:

# <center>ARRAY + TABLE = PANDAS</center>

![](https://memegenerator.net/img/instances/500x/51740682/you-are-a-python-lover-if-pandas-doesnt-mean-but-httppandaspydataorg.jpg)

## Installation

Simply,
```
pip install pandas
```

## Reading data from a CSV file

You can read data from a CSV file using the ``read_csv`` function. By default, it assumes that the fields are comma-separated.

In [None]:
import pandas as pd

In [None]:
imdb_df = pd.read_csv("data/imdb_1000.csv")

In [None]:
imdb_df.head()

In [None]:
bikes_df = pd.read_csv("data/bikes.csv")

In [None]:
bikes_df.head()

In [None]:
bikes_df = pd.read_csv("data/bikes.csv", sep=';', parse_dates=['Date'], dayfirst=True, index_col='Date')

In [None]:
bikes_df.head()

In [None]:
bikes_df.columns

In [None]:
bikes_df.dtypes

## Creating dataframe

A list of lists/tuples can be used to create a DataFrame.

In [None]:
# The inital set of baby names and bith rates
names = ['Bob','Jessica','Mary','John','Mel']
births = [968, 155, 77, 578, 973]

In [None]:
BabyDataSet = list(zip(names,births))

In [None]:
df = pd.DataFrame(data = BabyDataSet, columns=['Names', 'Births'])

In [None]:
df.head()

In [None]:
# saving dataframe as csv file
df.to_csv('data/births.csv')

## Selecting columns

When you read a CSV, you get a kind of object called a DataFrame, which is made up of rows and columns. You get columns out of a DataFrame the same way you get elements out of a dictionary.

In [None]:
imdb_df.columns

In [None]:
imdb_df['title']

In [None]:
type(imdb_df['title'])

In [None]:
bikes_df['Berri1']

In [None]:
# selecting multiple columns at once
imdb_df[['star_rating', 'genre']]

## Understanding columns

On the inside, the type of a column is ``pd.Series`` and pandas Series are internally numpy arrays. If you add ``.values`` to the end of any Series, you'll get its internal **numpy array**.

In [None]:
durations = imdb_df['duration']

In [None]:
type(durations)

In [None]:
type(durations.values)

In [None]:
duration_arr = durations.values

In [None]:
duration_arr[:5]

## Applying functions to columns

In [None]:
capitalizer = lambda x: x.upper()
imdb_df['title'].apply(capitalizer)

## Plotting a column

Use ``.plot()`` function!

In [None]:
%matplotlib inline

In [None]:
bikes_df['Berri1'].plot()

We can also plot all the columns just as easily. 

In [None]:
bikes_df.plot(figsize=(15, 10))

## Index

### DATAFRAME = COLUMNS + INDEX + ND DATA

### SERIES = INDEX + 1-D DATA

**Index** or (**row labels**) is one of the fundamental data structure of pandas. It can be thought of as an **immutable array** and an **ordered set**.

> Every row is uniquely identified by its index value.

In [None]:
bikes_df.index

In [None]:
# get row for date 2012-01-01
bikes_df.loc['2012-01-01']

#### To get row by integer index:

Use ``.iloc[]`` for purely integer-location based indexing for selection by position.

In [None]:
bikes_df.iloc[0]

## Slicing of dataframe

In [None]:
# fetch first 5 rows of dataframe
bikes_df[:5]

In [None]:
# fetch first 5 rows of a specific column
bikes_df['Berri1'][:5]

## Value counts

Get count of unique values in a particular column/Series.

In [None]:
imdb_df['genre'].value_counts()

In [None]:
# plotting value counts as a bar chart
imdb_df['genre'].value_counts().plot(kind='bar')

# Selecting rows where column has a particular value

In [None]:
imdb_df['genre'] == 'Adventure'

In [None]:
# select only those movies where genre is adventure
adventure_movies = imdb_df[imdb_df['genre'] == 'Adventure']

In [None]:
adventure_movies.head()

In [None]:
good_adventure_movies = imdb_df[(imdb_df['genre'] == 'Adventure') & (imdb_df['star_rating'] > 8.4)]

In [None]:
good_adventure_movies

In [None]:
# organised way
is_adventure = imdb_df['genre'] == 'Adventure'
has_high_rating = imdb_df['star_rating'] > 8.4
good_adventure_movies = imdb_df[is_adventure & has_high_rating]

# Just see title and duration of good adventure movies
good_adventure_movies[['title', 'duration']]

In [None]:
# which genre has highest number of movies with star rating above 8?
has_above_8_rating = imdb_df['star_rating'] >= 8.0
good_movies = imdb_df[has_above_8_rating]
good_movies_genre_count = good_movies['genre'].value_counts()

In [None]:
good_movies_genre_count

In [None]:
good_movies_genre_count.idxmax()

## Adding a new column to DataFrame

In [None]:
weekdays = bikes_df.index.weekday

In [None]:
weekdays

In [None]:
bikes_df['weekday'] = weekdays

## Deleting an existing column from DataFrame

In [None]:
bikes_df.head()

In [None]:
bikes_df.drop('Unnamed: 1', axis=1)

In [None]:
bikes_df.drop('Unnamed: 1', axis=1, inplace=True)

In [None]:
# deleting column no. 1, 2, and 3
bikes_df.drop(bikes_df.columns[[1,2,3]], axis=1)

## Deleting a row in DataFrame

In [None]:
df

In [None]:
df.drop(df.index[0])

In [None]:
df.drop([0,1,2])

In [None]:
# drop movies with rating less than 9.0
has_poor_rating = imdb_df['star_rating'] < 9.0
imdb_df.drop(imdb_df[has_poor_rating].index)

## Group By

Any groupby operation involves one of the following operations on the original object. They are −

- Splitting the Object

- Applying a function

- Combining the results

In many situations, we split the data into sets and we apply some functionality on each subset. In the apply functionality, we can perform the following operations −

- **Aggregation** − computing a summary statistic

- **Transformation** − perform some group-specific operation

- **Filtration** − discarding the data with some condition



In [None]:
weekday_groups = bikes_df.groupby('weekday')

In [None]:
weekday_counts = weekday_groups.aggregate(sum)

In [None]:
weekday_counts

In [None]:
weekday_counts.index = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

In [None]:
weekday_counts

In [None]:
weekday_counts['Berri1'].plot(kind='bar')

Let us see one more example!

In [None]:
genre_groups = imdb_df.groupby('genre')

In [None]:
genre_groups.groups

In [None]:
# get crime movies group
genre_groups.get_group('Crime')

In [None]:
import numpy as np

In [None]:
averages = genre_groups.aggregate(np.mean)

In [None]:
averages

In [None]:
# get sum, mean and std-dev of movie durations for each group
duration_analysis = genre_groups['duration'].aggregate([np.sum, np.mean, np.std])

In [None]:
duration_analysis

In [None]:
# change duration of all movies in a particular genre to mean duration of the group
averaged_movie_durations = genre_groups['duration'].transform(lambda x:x.mean())

In [None]:
# drop groups/genres that do not have average movie duration greater than 120.
genre_groups.filter(lambda x: x['duration'].mean() > 120)

## Exercises:

https://github.com/guipsamora/pandas_exercises/

Practice pandas using these exercises. Every exercise has 3 notebooks:
- Exercise
- Solutions
- Exercise with solutions

![](https://memegenerator.net/img/instances/500x/73988569/pythonpandas-is-easy-import-and-go.jpg)