Pandas notebook 
Author: moamen nssar 

Pandas library provides a really fast and efficient way to manage and explore data. It does that by providing us with Series and DataFrames, which help us not only to represent data efficiently but also manipulate it in various ways.
You can install this library in jupyter notebook easily by excuting the command " ! pip install pandas "in Jupyter notebook

In [None]:
! pip install pandas

Importing the pandas library :

In [1]:
import pandas as pd 
# Note : "pd" is just a name. you can import the library as whatever you want

Creating a Series by passing a list of values and indexs 

In [45]:
s=pd.Series(data=['a','b','c','a','a','b'], index=range(1,7))
s

1    a
2    b
3    c
4    a
5    a
6    b
dtype: object

#### you can create a series by passing only data
#### pandas creates a default integer index

In [43]:
pd.Series(['a','b','c','a','a','b'])

0    a
1    b
2    c
3    a
4    a
5    b
dtype: object

#### value_counts() shows the number of repeated values 


In [46]:
s.value_counts()

a    3
b    2
c    1
dtype: int64

make range of dates using pd.date_range

In [52]:
dates = pd.date_range("3/6/2022 00:00", periods=6, freq="D")
ts = pd.Series(['a','b','c','a','a','b'],dates)
ts

2022-03-06    a
2022-03-07    b
2022-03-08    c
2022-03-09    a
2022-03-10    a
2022-03-11    b
Freq: D, dtype: object

Creating a DataFrame

In [34]:
#using np.random.randn from numpy 
import numpy as np 
df = pd.DataFrame(np.random.randn(6, 4), index=range(6), columns=list("ABCD"))
#using head() to show the data frame,it shows only the first five rows by default but you write the number of rows you need 
df.head(6) 

Unnamed: 0,A,B,C,D
0,1.446645,-1.269751,0.045453,1.108388
1,-0.70928,0.689367,-0.469182,-0.294499
2,0.601449,-0.609585,-1.41899,0.423379
3,1.699102,-1.12856,-1.218438,-0.726534
4,0.195017,-0.420942,1.373465,1.71689
5,-0.892353,2.282675,2.060816,-1.241788


In [8]:
# use tail() instead of head to show bottom rows of the frame
df.tail(2) 

Unnamed: 0,A,B,C,D
4,0.037948,0.148696,-2.971566,1.612431
5,0.834276,1.296952,0.317497,2.01898


### note you can display index by using df.index 

In [10]:
#Display the columns:
df.columns
#note you can display index by using df.index 

Index(['A', 'B', 'C', 'D'], dtype='object')

describe() shows a quick statistic summary of your data:

In [11]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,0.270192,0.283451,-0.58434,0.515762
std,1.053153,1.04153,1.208683,1.301726
min,-1.036015,-1.370155,-2.971566,-1.271435
25%,-0.177851,-0.152054,-0.478564,-0.407715
50%,0.005815,0.294158,-0.218197,0.760236
75%,0.635194,1.082619,0.057783,1.407749
max,2.039621,1.437896,0.317497,2.01898


#### Sorting data
#### using axis first 
#### The parameter axis=1 refer to columns,while 0 refers to rows.
#### In this case you are sorting by columns, specifically index 1

In [15]:
df.sort_index(axis=1, ascending=False)

Unnamed: 0,A,B,C,D
5,0.834276,1.296952,0.317497,2.01898
4,0.037948,0.148696,-2.971566,1.612431
3,2.039621,-0.252305,-0.552588,0.793702
2,-0.228362,1.437896,-0.179904,-1.271435
1,-0.026318,0.439619,0.137011,0.72677
0,-1.036015,-1.370155,-0.256491,-0.785876


sort the data by the values on specific columns 

In [16]:
df.sort_values(by="C")

Unnamed: 0,A,B,C,D
4,0.037948,0.148696,-2.971566,1.612431
3,2.039621,-0.252305,-0.552588,0.793702
0,-1.036015,-1.370155,-0.256491,-0.785876
2,-0.228362,1.437896,-0.179904,-1.271435
1,-0.026318,0.439619,0.137011,0.72677
5,0.834276,1.296952,0.317497,2.01898


Selecting data

In [17]:
#Selecting a single column
df['B']

0   -1.370155
1    0.439619
2    1.437896
3   -0.252305
4    0.148696
5    1.296952
Name: B, dtype: float64

In [19]:
#Selecting specific rows 
df[0:2]

Unnamed: 0,A,B,C,D
0,-1.036015,-1.370155,-0.256491,-0.785876
1,-0.026318,0.439619,0.137011,0.72677


adding a new column to our dataframe with null values

In [35]:
df["E"] = [1,5.6,np.nan,1.2,np.nan,0.9]
df.head()

Unnamed: 0,A,B,C,D,E
0,1.446645,-1.269751,0.045453,1.108388,1.0
1,-0.70928,0.689367,-0.469182,-0.294499,5.6
2,0.601449,-0.609585,-1.41899,0.423379,
3,1.699102,-1.12856,-1.218438,-0.726534,1.2
4,0.195017,-0.420942,1.373465,1.71689,


dealing with missing values 

In [26]:
#get the number of missing values in each columns
df.isna().sum()

A    0
B    0
C    0
D    0
E    2
dtype: int64

In [27]:
#drop any rows that have missing data
''''
the command below won't change the data,it will just show the data here without the missing value
# use ' df=df.dropna()' to change the data 
''''
df.dropna()

Unnamed: 0,A,B,C,D,E
0,-1.036015,-1.370155,-0.256491,-0.785876,1.0
1,-0.026318,0.439619,0.137011,0.72677,5.6
3,2.039621,-0.252305,-0.552588,0.793702,1.2
5,0.834276,1.296952,0.317497,2.01898,0.9


In [36]:
#filling null values with specific value 
# u can use mean ,median and mod or anything to fill the missing data 
df.fillna(value=0)

Unnamed: 0,A,B,C,D,E
0,1.446645,-1.269751,0.045453,1.108388,1.0
1,-0.70928,0.689367,-0.469182,-0.294499,5.6
2,0.601449,-0.609585,-1.41899,0.423379,0.0
3,1.699102,-1.12856,-1.218438,-0.726534,1.2
4,0.195017,-0.420942,1.373465,1.71689,0.0
5,-0.892353,2.282675,2.060816,-1.241788,0.9


Applying functions to the data:


In [37]:
df.apply(np.cumsum)

Unnamed: 0,A,B,C,D,E
0,1.446645,-1.269751,0.045453,1.108388,1.0
1,0.737366,-0.580384,-0.423729,0.81389,6.6
2,1.338814,-1.189969,-1.842719,1.237269,
3,3.037916,-2.318529,-3.061158,0.510734,7.8
4,3.232933,-2.739471,-1.687692,2.227624,
5,2.34058,-0.456796,0.373123,0.985836,8.7


In [38]:
#using lambda to apply any change to the values
df.apply(lambda x: x.max() - x.min())


A    2.591455
B    3.552426
C    3.479806
D    2.958678
E    4.700000
dtype: float64

Concat
pandas provides various facilities for easily combining together Series and DataFrame objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.
Concatenating pandas objects together with concat():

In [55]:
# lets break data it into pieces then concat it again 
pd.concat([df[:3], df[3:]])

Unnamed: 0,A,B,C,D,E
0,1.446645,-1.269751,0.045453,1.108388,1.0
1,-0.70928,0.689367,-0.469182,-0.294499,5.6
2,0.601449,-0.609585,-1.41899,0.423379,
3,1.699102,-1.12856,-1.218438,-0.726534,1.2
4,0.195017,-0.420942,1.373465,1.71689,
5,-0.892353,2.282675,2.060816,-1.241788,0.9


Grouping
here in this process 
first the data is being splitting into groups based on some criteria,then applying a function to each group independently and finally all data is compined 

lets make another dataframe to apply grouping on it 

In [64]:
#this is another way to make dataframe with pandas
df2 = pd.DataFrame(
    {
        "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
        "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
        "C": np.random.randn(8),
        "D": np.random.randn(8),
    }
)
df2.head()

Unnamed: 0,A,B,C,D
0,foo,one,-1.492154,1.865415
1,bar,one,0.024809,-0.367547
2,foo,two,-0.134267,-0.928064
3,bar,three,1.132137,0.87824
4,foo,two,-0.308679,-1.453571


In [66]:
#groupby one columns 
# you can write any function you want instead of sum()
df2.groupby("A").sum()

Unnamed: 0_level_0,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1
bar,0.560295,0.313029
foo,2.393302,-0.819864


In [67]:
#groupby two columns 
df2.groupby(["A","B"]).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,0.024809,-0.367547
bar,three,1.132137,0.87824
bar,two,-0.596651,-0.197664
foo,one,1.671007,2.211417
foo,three,1.165242,-0.649647
foo,two,-0.442946,-2.381634
