PANDAS
------------

Pandas provides high-performance data manipulation and analysis tool using its powerful data structures. 


When working with tabular data, such as data stored in spreadsheets or databases, pandas is the right tool for you. 
pandas will help you to explore, clean, and process your data.


Supports the integration with many file formats or data sources out of the box (csv, excel, sql, json, xml,. . . ).


Selecting or filtering specific rows and/or columns. 

Filtering the data on a condition? Methods for slicing, selecting, and extracting the data you need are available in pandas.


Provides plotting your data out of the box, using the power of Matplotlib. 
You can pick the plot type (scatter, bar, boxplot,. . . ) corresponding to your data.


Basic statistics (mean, median, min, max, counts. . . ) are easily calculable. 

These or custom aggregations can be applied on the entire data set, a sliding window of the data, or grouped by categories


Multiple tables can be concatenated both column wise and row wise as database-like join/merge operations are provided to combine multiple tables of data.


Pandas has great support for time series and has an extensive set of tools for working with dates, times, and time-indexed data.


Data sets do not only contain numerical data. pandas provides a wide range of functions to clean textual data and extract useful information from it.


Pandas can be installed via

In [None]:
#! pip install pandas

In [1]:

import numpy as np
import pandas as pd


In [None]:
d ={
       "Name": ["Sachin","Rahul","Mithali"],
        "Age": [22, 35, 58],
        "Sex": ["male", "male", "female"],
}

df = pd.DataFrame(d)
print(df)

In [None]:
print(type(df))

In [None]:
print(df["Age"])

print(type(df["Age"]))

In [None]:
print(df[["Name", "Age"]])

type(df[["Name", "Age"]])

In [None]:
ages = pd.Series([22, 35, 58], name="Age")
print(ages)

print(type(ages))

### Writing and Reading data

In [None]:
import numpy as np
import pandas as pd

In [None]:
data_employee={ 'employee_id':np.arange(1,101),
                'Age':np.random.randint(25,60,size=100),
                'Basic Pay':np.random.randint(15600,67100,size=100),
                'No of Clients':np.random.randint(1,1000,size=100),
                'Years of Service':np.random.randint(0,41,size=100),
                'Performance Score':np.random.randint(0,2,size=100)
              }

df=pd.DataFrame(data_employee,columns=['employee_id','Age','Basic Pay','No of Clients','Years of Service','Performance Score'])            

print(df)

In [None]:
df.to_csv('emp.csv',sep=',',index=False)

In [None]:
df=pd.read_csv('emp.csv')

In [None]:
df

### Case Study 2

In [None]:
import pandas as pd

In [None]:
titanic = pd.read_csv("titanic.csv")

In [None]:
titanic

In [None]:
titanic.dtypes

In [None]:
titanic.count()

In [None]:
titanic.info()

In [None]:
titanic.describe()

In [None]:
titanic.describe(include="object")

In [None]:
titanic.describe(include="all")

In [None]:
ages = titanic["Age"]

In [None]:
ages.head()

In [None]:
ages.tail()

In [None]:
type(titanic["Age"])

In [None]:
titanic["Age"].shape

In [None]:
age_sex = titanic[["Age", "Sex"]]
age_sex.head(10)

In [None]:
type(titanic[["Age", "Sex"]])

In [None]:
titanic[["Age", "Sex"]].shape

In [None]:
titanic["Pclass"].value_counts(normalize=True)

In [None]:
titanic["Survived"].value_counts()

In [None]:
titanic["Survived"].unique()

In [None]:
titanic["Pclass"].unique()

In [None]:
# sort the Titanic data according to the age of the passengers

titanic.sort_values(by="Age")

In [None]:
# sort the Titanic data according to the cabin class and age in descending order

titanic.sort_values(by=['Pclass', 'Age'], ascending=False)


### Case Study3

In [None]:
import pandas as pd

df = pd.read_csv('whr.csv')
df

In [None]:
df['Country']

In [None]:
df[['Country','Happiness.Rank']]

In [None]:
df[['Country','Happiness.Rank','Health..Life.Expectancy.']]

In [None]:
df[:5]

In [None]:
df.iloc[:,:2]

In [None]:
df.iloc[:5,:2]

In [None]:
df[df['Country']=='India']

In [None]:
sorted_data = df.sort_values(by='Freedom')
sorted_data[:5]

### Apply function

In [None]:
df = pd.read_csv('whr.csv')
df

In [None]:
df.apply(lambda x : x)

In [None]:
df.apply(lambda x : x[0], axis=0)

In [None]:
df.apply(lambda x : x[0], axis=1)

In [None]:
df.apply(lambda x : x['Happiness.Score'], axis=1)

In [None]:
def clip_score(score):
    if score > 7 :
        score=7
    return score

hs = df['Happiness.Score'].apply(lambda x: clip_score(x))


In [None]:
hs

###  Descriptive Stats using CSV data

In [None]:
x = pd.read_csv('whr.csv')
x

In [None]:
print (x.index)

In [None]:
print (x.columns)

In [None]:
print (x.values)

In [None]:
print (x.shape)

In [None]:
print (x.count())

In [None]:
print (x.describe(include=object))

In [None]:
print (x.head())

In [None]:
print (x.tail())

### 4.Introd to DataFrame

In [None]:
import pandas as pd

In [None]:
data = [['Alex',10],['Bob',12],['Clarke',13]]
print(type(data))

df = pd.DataFrame(data,columns=['Name','Age'])
print (df)

In [None]:
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
print(type(data))

df = pd.DataFrame(data)
print (df)

#Create a DataFrame from Dict of ndarrays / Lists

In [None]:
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])
print (df)

In [None]:
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]

df = pd.DataFrame(data)
print (df)

In [None]:
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data, index=['first', 'second'])
print (df)

In [None]:
import pandas as pd

d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
      'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
df

In [None]:
print(df['one'])

In [None]:
print (df[['one','two']])

In [None]:
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
      'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
df
# Adding a new column to an existing DataFrame object with column label by passing new series

#### Adding a new column by passing as Series

In [None]:
df['three']=pd.Series([10,20,30],index=['a','b','c'])
print (df)

Adding a new column using the existing columns in DataFrame:

In [None]:
df['four']=df['one'] + df['three']
print (df)

In [None]:
# using del function
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
     'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']), 
     'three' : pd.Series([10,20,30], index=['a','b','c'])}
df = pd.DataFrame(d)
print (df)

#### Deleting the first column using DEL function

In [None]:
del df['one']
print (df)

### using pop function

In [None]:
df.pop('two')
print (df)

In [None]:
print (df.loc['c'])

#Selection by Label Rows can be selected by passing row label 
# to a loc function. 

In [None]:
print (df.iloc[2])

#Rows can be selected by passing integer location to an iloc function.

#### Multiple rows can be selected using ‘ : ’ operator.

In [None]:
print (df[2:4])

## Iterating over dataframe

In [None]:
import pandas as pd

d = {
    'Name':pd.Series(['Tom','James','Ricky','Vin','Steve']),
    'Age':pd.Series([25,26,25,23,30]),
    'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20])
     }
df = pd.DataFrame(d)
print (df)

In [None]:
for index, row in df.iterrows():
    print(f"Index: {index}, Name: {row['Name']}, Age: {row['Age']}, Rating: {row['Rating']}")

In [None]:
for row in df.itertuples():
    print(f"Index: {row.Index}, Name: {row.Name}, Age: {row.Age}, Rating: {row.Rating}")

In [None]:
df.columns

In [None]:
df.values

In [None]:
for col in df.columns:
    #print(f"Column: {col}")
    print(df[col])

In [None]:
for row in df.values:
    print(f"Name: {row[0]}, Age: {row[1]}, Rating: {row[2]}")

In [None]:

for key, value in df.items():
    print(f"Column: {key}, Series: {value}")

### 5. Working on Dataframes

In [2]:
import pandas as pd
import numpy as np
raw_data = {
        'first_name': ['Jason', np.nan, 'Tina', 'Jake', 'Amy'],
        'last_name': ['Miller', np.nan, 'Ali', 'Milner', 'Cooze'],
        'age': [42, np.nan, 36, 24, 73],
        'sex': ['m', np.nan, 'f', 'm', 'f'],
        'preTestScore': [4, np.nan, np.nan, 2, 3],
        'postTestScore': [25, np.nan, np.nan, 62, 70]
        }

df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'sex', 'preTestScore', 'postTestScore'])
print (df)

  first_name last_name   age  sex  preTestScore  postTestScore
0      Jason    Miller  42.0    m           4.0           25.0
1        NaN       NaN   NaN  NaN           NaN            NaN
2       Tina       Ali  36.0    f           NaN            NaN
3       Jake    Milner  24.0    m           2.0           62.0
4        Amy     Cooze  73.0    f           3.0           70.0


### Drop missing observations

In [3]:
df_no_missing = df.dropna()
print (df_no_missing)

  first_name last_name   age sex  preTestScore  postTestScore
0      Jason    Miller  42.0   m           4.0           25.0
3       Jake    Milner  24.0   m           2.0           62.0
4        Amy     Cooze  73.0   f           3.0           70.0


### Drop rows where all cells in that row is NA

In [4]:
df_cleaned = df.dropna(how='all')

print(df_cleaned)

  first_name last_name   age sex  preTestScore  postTestScore
0      Jason    Miller  42.0   m           4.0           25.0
2       Tina       Ali  36.0   f           NaN            NaN
3       Jake    Milner  24.0   m           2.0           62.0
4        Amy     Cooze  73.0   f           3.0           70.0


### Drop rows that contain less than four observations

In [8]:
print(df.dropna(thresh=5))

  first_name last_name   age sex  preTestScore  postTestScore
0      Jason    Miller  42.0   m           4.0           25.0
3       Jake    Milner  24.0   m           2.0           62.0
4        Amy     Cooze  73.0   f           3.0           70.0


#### Fill in missing data with zeros

In [10]:
print(df.fillna(0))

  first_name last_name   age sex  preTestScore  postTestScore
0      Jason    Miller  42.0   m           4.0           25.0
1          0         0   0.0   0           0.0            0.0
2       Tina       Ali  36.0   f           0.0            0.0
3       Jake    Milner  24.0   m           2.0           62.0
4        Amy     Cooze  73.0   f           3.0           70.0


### Fill in missing in preTestScore with the mean value of preTestScore

In [13]:
import warnings
warnings.filterwarnings('ignore')

In [11]:
pre_mean = df_cleaned['preTestScore'].mean()
pre_mean

3.0

In [14]:
df_cleaned['preTestScore'] = df_cleaned['preTestScore'].fillna(pre_mean)
print(df_cleaned)

  first_name last_name   age sex  preTestScore  postTestScore
0      Jason    Miller  42.0   m           4.0           25.0
2       Tina       Ali  36.0   f           3.0            NaN
3       Jake    Milner  24.0   m           2.0           62.0
4        Amy     Cooze  73.0   f           3.0           70.0


#### Fill in missing in postTestScore with the median value of postTestScore

In [15]:
post_median = df_cleaned['postTestScore'].median()
df_cleaned['postTestScore'] = df_cleaned['postTestScore'].fillna(post_mean)

print(df_cleaned)

  first_name last_name   age sex  preTestScore  postTestScore
0      Jason    Miller  42.0   m           4.0           25.0
2       Tina       Ali  36.0   f           3.0           62.0
3       Jake    Milner  24.0   m           2.0           62.0
4        Amy     Cooze  73.0   f           3.0           70.0
