# Pandas beginner Tutorials

<img src="https://miro.medium.com/max/1080/1*3qZ_SHAVX6RAbRMHo4NCcA.jpeg" width ="500" height=500 >

**In this tutorial, you will get the basic knowledge and functions in Pandas and we will use Pandas in the next days very often. Also, the most famous and classic dataset Iris is used as an example to introduce how to use Pandas dataframe.**


The pandas package is the most important tool at the disposal of Data Scientists and Analysts working in Python today. The powerful machine learning and glamorous visualization tools may get all the attention, but pandas is the backbone of most data projects.

For example, say you want to explore a dataset stored in a CSV on your computer. Pandas will extract the data from that CSV into a DataFrame — a table. Pandas is built on top of the NumPy package, meaning a lot of the structure of NumPy is used or replicated in Pandas. Data in pandas is often used to feed statistical analysis in SciPy, plotting functions from Matplotlib, and machine learning algorithms in Scikit-learn.

Python 3.5.3 and above is required to install Pandas.

In order to import Pandas all you have to do is run the following code:

In [1]:
import pandas as pd 
import numpy as np 

## Start from a small dataset with Pandas

Load hard-coded data into dataframe

In [34]:
data = {
    'apples': [3, 2, 0, 1], 
    'oranges': [0, 3, 7, 2]
}
print(data)

{'apples': [3, 2, 0, 1], 'oranges': [0, 3, 7, 2]}


In [6]:
purchases = pd.DataFrame(data)
purchases

Unnamed: 0,apples,oranges
0,3,0
1,2,3
2,0,7
3,1,2


Let's have customer names as our index:

In [7]:
purchases = pd.DataFrame(data, index=['June', 'Robert', 'Lily', 'David'])

purchases

Unnamed: 0,apples,oranges
June,3,0
Robert,2,3
Lily,0,7
David,1,2


So now we could locate a customer's order by using their name:

In [8]:
purchases.loc['June']

apples     3
oranges    0
Name: June, dtype: int64

### How to read in data

Read a comma-separated values (csv) file into DataFrame.
##### Read from file

In [None]:
data=pd.read_excel('names.xlsx')

Use index_col to specify the row labels to use. If you set index_col to 0, then the first column of the dataframe will become the row label. Notice how the row labels change.

In [None]:
data=pd.read_excel('names.xlsx',index_col=0)

##### Read from URL
Get the classic Iris data

In [18]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
df = pd.read_csv(url)
df

Unnamed: 0,5.1,3.5,1.4,0.2,Iris-setosa
0,4.9,3.0,1.4,0.2,Iris-setosa
1,4.7,3.2,1.3,0.2,Iris-setosa
2,4.6,3.1,1.5,0.2,Iris-setosa
3,5.0,3.6,1.4,0.2,Iris-setosa
4,5.4,3.9,1.7,0.4,Iris-setosa
...,...,...,...,...,...
144,6.7,3.0,5.2,2.3,Iris-virginica
145,6.3,2.5,5.0,1.9,Iris-virginica
146,6.5,3.0,5.2,2.0,Iris-virginica
147,6.2,3.4,5.4,2.3,Iris-virginica


In [20]:
# Assign colum names to the dataset
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']
dataset = pd.read_csv(url, names=names)
dataset

Unnamed: 0,sepal-length,sepal-width,petal-length,petal-width,Class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


### View your data
The first thing to do when opening a new dataset is print out a few rows to keep as a visual reference. We accomplish this with .head():

In [21]:
dataset.head()

Unnamed: 0,sepal-length,sepal-width,petal-length,petal-width,Class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


.head() outputs the first five rows of your DataFrame by default, but we could also pass a number as well: dataset.head(10) would output the top ten rows, for example.

In [22]:
dataset.head(10)

Unnamed: 0,sepal-length,sepal-width,petal-length,petal-width,Class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa


To see the last five rows use .tail(). tail() also accepts a number, and in this case we printing the bottom two rows.:

In [23]:
dataset.tail(2)

Unnamed: 0,sepal-length,sepal-width,petal-length,petal-width,Class
148,6.2,3.4,5.4,2.3,Iris-virginica
149,5.9,3.0,5.1,1.8,Iris-virginica


### Getting info about your data
.info() should be one of the very first commands you run after loading your data.

.info() provides the essential details about your dataset, such as the number of rows and columns, the number of non-null values, what type of data is in each column, and how much memory your DataFrame is using.

In [24]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal-length  150 non-null    float64
 1   sepal-width   150 non-null    float64
 2   petal-length  150 non-null    float64
 3   petal-width   150 non-null    float64
 4   Class         150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


Another fast and useful attribute is .shape, which outputs just a tuple of (rows, columns):

In [25]:
dataset.shape

(150, 5)

#### Columns
Here's how to print the column names of our dataset:

In [26]:
dataset.columns

Index(['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class'], dtype='object')

#### Selection and Indexing Methods for Pandas DataFrames
we have two options:

.loc - locates by name

.iloc- locates by numerical index

In [30]:
dataset.loc[1:4]

Unnamed: 0,sepal-length,sepal-width,petal-length,petal-width,Class
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


### The iloc indexer syntax is data.iloc[ row selection,  column selection]

### Single selections using iloc and DataFrame
#### Rows:
1. data.iloc[0] # first row of data frame
2. data.iloc[1] # second row of data frame 
3. data.iloc[-1] # last row of data frame 

#### Columns:
1. data.iloc[:,0] # first column of data frame (first_name)
2. data.iloc[:,1] # second column of data frame (last_name)
3. data.iloc[:,-1] # last column of data frame (id)

Multiple columns and rows can be selected together using the .iloc indexer.

### Multiple row and column selections using iloc and DataFrame
1. data.iloc[0:5] # first five rows of dataframe
2. data.iloc[:, 0:2] # first two columns of data frame with all rows
3. data.iloc[[0,3,6,24], [0,5,6]] # 1st, 4th, 7th, 25th row + 1st 6th 7th columns.
4. data.iloc[0:5, 5:8] # first 5 rows and 5th, 6th, 7th columns of data frame

In [33]:
dataset.iloc[1:5]

Unnamed: 0,sepal-length,sepal-width,petal-length,petal-width,Class
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


dataset.loc[1:4] and dataset.iloc[1:5] are same.

dataset.loc[1:4] is the name of the columns

dataset.iloc[1:5] is the index of the columns