<a href="https://colab.research.google.com/github/kritikasaraswat99/Data-Visualization/blob/main/Pandas_%26_Pandas_Profiling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Pandas**


Pandas is a package used for managing data.

Think of a pandas dataframe like an excel spreadsheet that is storing some data. One column can have customer name, one column can have product sold name, another column can have price or quantity... Then the rows could be individual sales.

A dataframe is made up of several series. Each column of a dataframe is a series.

We can name each column and row of a dataframe.

A pandas dataframe is very similar to a data.frame in R.

Similar to numpy arrays, a dataframe is a more robust data type for storing data than lists of lists. Dataframes are more flexible than numpy arrays.

A numpy array can create a matrix with all entries of the same data type. In a dataframe each column can have its own datatype.

It is often easiest to convert some subset of a dataframe to a numpy array and then use that to do some math.

Pandas also has SQL-like functions for merging, joining, and sorting dataframes

In [None]:
import pandas as pd
import numpy as np

In [None]:
mylist = [5.4,6.1,1.7,99.8]
myarray = np.array(mylist)

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index.

In [None]:
myseries1 = pd.Series(data=mylist)
myseries2 = pd.Series(data = myarray)

In [None]:
print(myseries1)
print(myseries2)

0     5.4
1     6.1
2     1.7
3    99.8
dtype: float64
0     5.4
1     6.1
2     1.7
3    99.8
dtype: float64


In [None]:
print(myseries1[2])

1.7


In [None]:
# we can add labels to the entries of a series

my_labels = ['First', 'Second', 'Third','Four']
myseries3 = pd.Series(data=mylist,index=my_labels)
print(myseries3)

First      5.4
Second     6.1
Third      1.7
Four      99.8
dtype: float64


In [None]:
# we need not be explicit about the entries of pd.Series

myseries3 = pd.Series(mylist,index=my_labels)
myseries3

First      5.4
Second     6.1
Third      1.7
Four      99.8
dtype: float64

In [None]:
# we can also access entries using the index labels
print(myseries3['Second'])

6.1


In [None]:
# creating another series
myseries5 = pd.Series([5.5,1.1,8.8,1.6],['first', 'second', 'Third','four'])

In [None]:
myseries5

first     5.5
second    1.1
Third     8.8
four      1.6
dtype: float64

In [None]:
print(myseries5+myseries3)

First      NaN
Four       NaN
Second     NaN
Third     10.5
first      NaN
four       NaN
second     NaN
dtype: float64


In [None]:
print(myseries3)
print(myseries5)

First      5.4
Second     6.1
Third      1.7
Four      99.8
dtype: float64
first     5.5
second    1.1
Third     8.8
four      1.6
dtype: float64


In [None]:
# we can combine series to make a dataframe using concat function

# df1 = pd.concat([myseries3,myseries5],axis=0,sort=False) #stacks on top of eachother
df1 = pd.concat([myseries3,myseries5],axis=1,sort=False) #stacks on side of eachother
df1

Unnamed: 0,0,1
First,5.4,
Second,6.1,
Third,1.7,8.8
Four,99.8,
first,,5.5
second,,1.1
four,,1.6


In [None]:
# create a dataframe

df2 = pd.DataFrame(np.random.randn(5,5))
df2

Unnamed: 0,0,1,2,3,4
0,0.477929,0.512973,0.298987,0.571736,-0.675657
1,0.52831,1.006699,1.076277,0.482885,0.944089
2,1.581434,-0.237647,0.631048,0.055389,-0.321568
3,-0.480784,-0.606685,-0.589807,0.626189,-0.614027
4,-0.968288,-0.455186,0.223686,-0.631441,0.245505


In [None]:
# lets give labels to rows and columns
df3 = pd.DataFrame(np.random.randn(5,5),index=['first row','second row','third row','fourth row','fifth row'],
                   columns=['first col','second col','third col','fourth col','fifth col'])
df3

Unnamed: 0,first col,second col,third col,fourth col,fifth col
first row,0.393836,-0.615208,0.260332,-0.03365,-0.594216
second row,-1.057177,0.442267,-0.707232,-0.310117,0.094763
third row,-0.683347,-1.472729,1.365848,0.796311,-1.504446
fourth row,-2.365181,0.912719,-0.399604,0.350921,-0.610919
fifth row,-0.152003,-0.850505,0.969284,-0.468978,0.258238


Accessing and modifying dataframe

In [None]:
df3['second col']

first row    -0.615208
second row    0.442267
third row    -1.472729
fourth row    0.912719
fifth row    -0.850505
Name: second col, dtype: float64

In [None]:
df3[['second col','first col']]

Unnamed: 0,second col,first col
first row,-0.615208,0.393836
second row,0.442267,-1.057177
third row,-1.472729,-0.683347
fourth row,0.912719,-2.365181
fifth row,-0.850505,-0.152003


In [None]:
# we can access rows of the dataframe

df3.loc['fourth row']

first col    -2.365181
second col    0.912719
third col    -0.399604
fourth col    0.350921
fifth col    -0.610919
Name: fourth row, dtype: float64

In [None]:
# access a row
df3.iloc[2]

first col    -0.683347
second col   -1.472729
third col     1.365848
fourth col    0.796311
fifth col    -1.504446
Name: third row, dtype: float64

In [None]:
#certain rows and certain columns

df = df3.loc[['first row','second row'],["first col","second col"]]

Unnamed: 0,first col,second col
first row,0.393836,-0.615208
second row,-1.057177,0.442267


In [None]:
df3>0

Unnamed: 0,first col,second col,third col,fourth col,fifth col
first row,True,False,True,False,False
second row,False,True,False,False,True
third row,False,False,True,True,False
fourth row,False,True,False,True,False
fifth row,False,False,True,False,True


In [None]:
print(df3[df3>0])

            first col  second col  third col  fourth col  fifth col
first row    0.393836         NaN   0.260332         NaN        NaN
second row        NaN    0.442267        NaN         NaN   0.094763
third row         NaN         NaN   1.365848    0.796311        NaN
fourth row        NaN    0.912719        NaN    0.350921        NaN
fifth row         NaN         NaN   0.969284         NaN   0.258238


Saving and loading

In [None]:
# save to a csv file

df3.to_csv('df3.csv', index = True)

In [None]:
#saving in excel format
df3.to_excel('data.xlsx', index =False, sheet_name='first sheet')



In [None]:
#loading a file
new_df = pd.read_excel('data.xlsx', sheet_name = 'first sheet', index_col =1)

In [None]:
new_df

Unnamed: 0_level_0,first col,third col,fourth col,fifth col
second col,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
-0.615208,0.393836,0.260332,-0.03365,-0.594216
0.442267,-1.057177,-0.707232,-0.310117,0.094763
-1.472729,-0.683347,1.365848,0.796311,-1.504446
0.912719,-2.365181,-0.399604,0.350921,-0.610919
-0.850505,-0.152003,0.969284,-0.468978,0.258238


# **Pandas Profiling**

In [None]:
cd /content/drive/MyDrive/studies/olympus/data_visualization

/content/drive/MyDrive/studies/olympus/data_visualization


In [None]:
ls

 Automobile.csv
'M1W2 - Visualization-Questions.ipynb'
 output.html
'Practice Exercise Solution Notebook - Visualizations.ipynb'
 tips.csv
 tips.gsheet
 Untitled0.ipynb
 Visualization.ipynb


In [None]:
data = pd.read_csv('tips.csv')

In [None]:
data

Unnamed: 0.1,Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,0,16.99,1.01,Female,No,Sun,Dinner,2
1,1,10.34,1.66,Male,No,Sun,Dinner,3
2,2,21.01,3.50,Male,No,Sun,Dinner,3
3,3,23.68,3.31,Male,No,Sun,Dinner,2
4,4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...,...
239,239,29.03,5.92,Male,No,Sat,Dinner,3
240,240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,242,17.82,1.75,Male,No,Sat,Dinner,2


In [None]:
data.shape

(244, 8)

In [None]:
data.head(10)

Unnamed: 0.1,Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,0,16.99,1.01,Female,No,Sun,Dinner,2
1,1,10.34,1.66,Male,No,Sun,Dinner,3
2,2,21.01,3.5,Male,No,Sun,Dinner,3
3,3,23.68,3.31,Male,No,Sun,Dinner,2
4,4,24.59,3.61,Female,No,Sun,Dinner,4
5,5,25.29,4.71,Male,No,Sun,Dinner,4
6,6,8.77,2.0,Male,No,Sun,Dinner,2
7,7,26.88,3.12,Male,No,Sun,Dinner,4
8,8,15.04,1.96,Male,No,Sun,Dinner,2
9,9,14.78,3.23,Male,No,Sun,Dinner,2


In [None]:
data.describe()

Unnamed: 0.1,Unnamed: 0,total_bill,tip,size
count,244.0,244.0,244.0,244.0
mean,121.5,19.785943,2.998279,2.569672
std,70.580923,8.902412,1.383638,0.9511
min,0.0,3.07,1.0,1.0
25%,60.75,13.3475,2.0,2.0
50%,121.5,17.795,2.9,2.0
75%,182.25,24.1275,3.5625,3.0
max,243.0,50.81,10.0,6.0


In [None]:
from pandas_profiling import ProfileReport

In [None]:
!pip install pandas-profiling==2.8.0



In [None]:
!pip freeze |grep pandas-profiling

pandas-profiling==2.8.0


In [None]:
profile = ProfileReport(data, title='Pandas Profiling Report', explorative=True)

In [None]:
profile

Summarize dataset:   0%|          | 0/22 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



to save the life in local system.

In [None]:
profile.to_file("output.html")

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]