* loc gets rows (and/or columns) with particular labels.
* iloc gets rows (and/or columns) at integer locations.

* For data with float64 dtype, pandas uses the floating-point value NaN (Not a Number) to represent missing data.

* The isna method gives us a Boolean Series with True where values are null

In [None]:
import numpy as np
import pandas as pd

data=pd.Series([1.2,3,4,-6,np.nan])
data

0    1.2
1    3.0
2    4.0
3   -6.0
4    NaN
dtype: float64

In [None]:
data.isna()

0    False
1    False
2    False
3    False
4     True
dtype: bool

The built-in Python None value is also treated as NA

In [None]:
data=pd.Series([1,2,2.4,None,np.nan,5])
data

0    1.0
1    2.0
2    2.4
3    NaN
4    NaN
5    5.0
dtype: float64

In [None]:
data.isna()

0    False
1    False
2    False
3     True
4     True
5    False
dtype: bool

**NA handling object methods**

**dropna** Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate.

**fillna** Fill in missing data with some value or using an interpolation method such as "ffill" or "bfill".

**isna** Return Boolean values indicating which values are missing/NA.

**notna** Negation of isna, returns True for non-NA values and False for NA values.


In [None]:
data=pd.Series([np.nan,2,5,4.5,np.nan])
data

0    NaN
1    2.0
2    5.0
3    4.5
4    NaN
dtype: float64

In [None]:
#it returns the Series with only the nonnull data and index values:
data.dropna()

1    2.0
2    5.0
3    4.5
dtype: float64

In [None]:
data

0    NaN
1    2.0
2    5.0
3    4.5
4    NaN
dtype: float64

In [None]:
data.notna()

0    False
1     True
2     True
3     True
4    False
dtype: bool

In [None]:
#This is the same thing as dropna():
data[data.notna()]

1    2.0
2    5.0
3    4.5
dtype: float64

In [None]:
data = pd.DataFrame([[1., 6.5, 3.], [1., np.nan, np.nan],
                     [np.nan, np.nan, np.nan], [np.nan, 6.5, 3.]],columns=list('ABC'))

data


Unnamed: 0,A,B,C
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [None]:
data.notna()

Unnamed: 0,A,B,C
0,True,True,True
1,True,False,False
2,False,False,False
3,False,True,True


In [None]:
data.isna()

Unnamed: 0,A,B,C
0,False,False,False
1,False,True,True
2,True,True,True
3,True,False,False


In [None]:
# will drop only rows that having atleast one NA or non null data
data.dropna()

Unnamed: 0,A,B,C
0,1.0,6.5,3.0


In [None]:
#how="all" will drop only rows that are all NA
data.dropna(how='all')

Unnamed: 0,A,B,C
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


In [None]:
data['C']=np.nan

In [None]:
data

Unnamed: 0,A,B,C
0,1.0,6.5,
1,1.0,,
2,,,
3,,6.5,


In [None]:
data.dropna(how='all',axis=1)

Unnamed: 0,A,B
0,1.0,6.5
1,1.0,
2,,
3,,6.5


In [None]:
##will drop only columns that are all NA
data.dropna(how='all',axis='columns')

Unnamed: 0,A,B
0,1.0,6.5
1,1.0,
2,,
3,,6.5


In [None]:
data

Unnamed: 0,A,B,C
0,1.0,6.5,
1,1.0,,
2,,,
3,,6.5,


In [None]:
#To keep only rows containing at most a certain number of missing observations.You can indicate this with the thresh argument.
#drop the value having na count 2 or more then that.
data.dropna(thresh=2)

Unnamed: 0,A,B,C
0,1.0,6.5,


### **Filling In Missing Data**

* Rather than filtering out missing data some time we want to fill the values so this can be done in numbers of the ways.
* Calling fillna with a constant replaces missing values with that value:

In [None]:
data

Unnamed: 0,A,B,C
0,1.0,6.5,
1,1.0,,
2,,,
3,,6.5,


In [None]:
data.fillna(0)

Unnamed: 0,A,B,C
0,1.0,6.5,0.0
1,1.0,0.0,0.0
2,0.0,0.0,0.0
3,0.0,6.5,0.0


In [None]:
#Calling fillna with a dictionary, you can use a different fill value for each column
data.fillna({'A': 0.5, 'C': 1})

Unnamed: 0,A,B,C
0,1.0,6.5,1.0
1,1.0,,1.0
2,0.5,,1.0
3,0.5,6.5,1.0


In [None]:
data.loc[0,'C']=2.5

In [None]:
data

Unnamed: 0,A,B,C
0,1.0,6.5,2.5
1,1.0,,
2,,,
3,,6.5,
C,,,


In [None]:
data.fillna(method='ffill')

Unnamed: 0,A,B,C
0,1.0,6.5,2.5
1,1.0,6.5,2.5
2,1.0,6.5,2.5
3,1.0,6.5,2.5
C,1.0,6.5,2.5


In [None]:
data.fillna(method='ffill',limit=2)

Unnamed: 0,A,B,C
0,1.0,6.5,2.5
1,1.0,6.5,2.5
2,1.0,6.5,2.5
3,1.0,6.5,
C,,6.5,


In [None]:
data.fillna(data.mean())

Unnamed: 0,A,B,C
0,1.0,6.5,2.5
1,1.0,6.5,2.5
2,1.0,6.5,2.5
3,1.0,6.5,2.5
C,1.0,6.5,2.5


In [None]:
data.fillna(data.median())

Unnamed: 0,A,B,C
0,1.0,6.5,2.5
1,1.0,6.5,2.5
2,1.0,6.5,2.5
3,1.0,6.5,2.5
C,1.0,6.5,2.5


In [None]:
data = pd.Series([1., np.nan, 3.5, np.nan, 7])
print(data)

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64


In [None]:
data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

In [None]:
data.fillna(data.median())

0    1.0
1    3.5
2    3.5
3    3.5
4    7.0
dtype: float64

In [None]:
data = pd.DataFrame({"k1": ["one", "two"] * 3 + ["two"],"k2": [1, 1, 2, 3, 3, 4, 4]})

In [None]:
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


In [None]:
#The DataFrame method duplicated returns a Boolean Series indicating whether each row is a duplicate
data['k2'].duplicated()

0    False
1     True
2    False
3    False
4     True
5    False
6     True
Name: k2, dtype: bool

In [None]:
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

In [None]:
#drop_duplicates returns a DataFrame with rows where the duplicated array is False filtered out:
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


In [None]:
# you can specify any subset of them to detect duplicates.
data.drop_duplicates(subset=['k1'])

Unnamed: 0,k1,k2
0,one,1
1,two,1


In [None]:
#duplicated and drop_duplicates by default keep the first observed value combination. Passing keep="last" will return the last one
data.drop_duplicates(subset=['k1'],keep='last')

Unnamed: 0,k1,k2
4,one,3
6,two,4


In [None]:
data = pd.DataFrame({"food": ["bacon", "pulled pork", "bacon","pastrami", "corned beef"],"ounces": [4, 3, 12, 6, 7.5]})
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,pastrami,6.0
4,corned beef,7.5


In [None]:
meat_to_animal = {
 "bacon": "pig",
 "pulled pork": "pig",
 "pastrami": "cow",
 "corned beef": "cow"
}


In [None]:
data['am=nimals']=data['food'].map(meat_to_animal)

In [None]:
#To replace these with NA values that pandas understands, we can use replace, producing a new Series
data['ounces'].replace(4.0,np.nan)

pandas.core.series.Series

In [None]:
data

Unnamed: 0,food,ounces,am=nimals
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,pastrami,6.0,cow
4,corned beef,7.5,cow


In [None]:
#rename is used to rename the column or index name.
data.rename(columns=str.upper)

Unnamed: 0,FOOD,OUNCES,AM=NIMALS
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,pastrami,6.0,cow
4,corned beef,7.5,cow


In [None]:
ages = [20,21, 23, 37, 31, 61, 45, 41, 32]
bins = [18, 25, 35, 60, 100]

In [None]:
#Continuous data is often discretized or otherwise separated into “bins” for analysis.
age_cat=pd.cut(ages,bins)
print(age_cat)

[(18, 25], (18, 25], (18, 25], (35, 60], (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Categories (4, interval[int64, right]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]


In [None]:
age_cat.codes

array([0, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [None]:
cat=age_cat.categories
cat

IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]], dtype='interval[int64, right]')

In [None]:
age_cat.categories[0]

Interval(18, 25, closed='right')

In [None]:
pd.value_counts(age_cat)

(18, 25]     3
(35, 60]     3
(25, 35]     2
(60, 100]    1
dtype: int64

In [None]:
#pd.value_counts(categories) are the bin counts for the result of pandas.cut.
pd.value_counts(age_cat)

(18, 25]     3
(35, 60]     3
(25, 35]     2
(60, 100]    1
dtype: int64

In the string representation of an interval, a parenthesis means that the side is open (exclusive), while the square bracket means it is closed (inclusive). You can change which side is closed by passing right=False

In [None]:
pd.cut(ages, bins, right=False)


[[18, 25), [18, 25), [18, 25), [35, 60), [25, 35), [60, 100), [35, 60), [35, 60), [25, 35)]
Categories (4, interval[int64, left]): [[18, 25) < [25, 35) < [35, 60) < [60, 100)]

pandas.qcut, bins the data based on sample quantiles.Depending on the distribution of the data, using pandas.cut will not usually result in each bin having the same number of data points. Since pandas.qcut uses sample quantiles instead, you will obtain roughly equally sized bins

In [None]:
data = np.random.standard_normal(100)
quantile=pd.qcut(data,5,precision=2)

In [None]:
quantile

[(0.61, 2.18], (-0.86, -0.31], (0.61, 2.18], (-0.31, 0.095], (0.61, 2.18], ..., (0.095, 0.61], (-2.8899999999999997, -0.86], (-0.86, -0.31], (0.61, 2.18], (-0.86, -0.31]]
Length: 100
Categories (5, interval[float64, right]): [(-2.8899999999999997, -0.86] < (-0.86, -0.31] < (-0.31, 0.095] <
                                           (0.095, 0.61] < (0.61, 2.18]]

In [None]:
import numpy as np
import pandas as pd

df = pd.DataFrame({"key": ["b", "b", "a", "c", "a", "b"],"data1": range(6)})


In [None]:
df

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,b,5


In [None]:
pd.get_dummies(df['key'])

Unnamed: 0,a,b,c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


In [None]:
s = pd.Series([1, 2, 3, None], dtype="Int64")

In [None]:
s

0       1
1       2
2       3
3    <NA>
dtype: Int64

In [None]:
s = pd.Series(['one', 'two', None, 'three'], dtype=pd.StringDtype())
s

0      one
1      two
2     <NA>
3    three
dtype: string

In [None]:
#Extension types can be passed to the Series astype method, allowing you to convert easily as part of your data cleaning process:.

df = pd.DataFrame({"A": [1, 2, None, 4],"B": ["one", "two", "three", None],"C": [False, None, False, True]})
df



Unnamed: 0,A,B,C
0,1.0,one,False
1,2.0,two,
2,,three,False
3,4.0,,True


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       3 non-null      float64
 1   B       3 non-null      object 
 2   C       3 non-null      object 
dtypes: float64(1), object(2)
memory usage: 224.0+ bytes


In [None]:
df['A']=df['A'].astype('Int64')

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   A       3 non-null      Int64 
 1   B       3 non-null      object
 2   C       3 non-null      object
dtypes: Int64(1), object(2)
memory usage: 228.0+ bytes


### **String Manipulation**


In [None]:
val = "a,b, guido"
val.split(',')

['a', 'b', ' guido']

In [None]:
for x in val.split(","):
  print(str.strip(x))

a
b
guido


In [None]:
b=[x.strip() for x in val.split(",")]
print(b)

['a', 'b', 'guido']


In [None]:
first,second,third=b

In [None]:
print(first,second,third)

a b guido


In [None]:
first+"::"+second+"::"+third

'a::b::guido'

In [None]:
'guido' in b

True

In [None]:
val.find(':')

-1

In [None]:
import pandas as pd
values = pd.Series(['apple', 'orange', 'apple','apple'] * 2)
values

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
dtype: object

In [None]:
values.unique()

array(['apple', 'orange'], dtype=object)

In [None]:
pd.unique(values)

array(['apple', 'orange'], dtype=object)

In [None]:
pd.value_counts(values)

apple     6
orange    2
dtype: int64

Many data systems (for data warehousing, statistical computing, or other uses) have developed specialized approaches for representing data with repeated values for more efficient storage and computation. In data warehousing, a best practice is to use so-called dimension tables containing the distinct values and storing the primary observations as integer keys referencing the dimension table

In [None]:
#We can use the take method to restore the original Series of strings:
values = pd.Series([0, 1, 0, 0] * 2)
dim = pd.Series(['apple', 'orange'])
dim.take(values)

0     apple
1    orange
0     apple
0     apple
0     apple
1    orange
0     apple
0     apple
dtype: object

#### Categorical types can improve performance and memory use, so let’s look at some examples.

In [None]:
n=100000
label=pd.Series(['a','b','c']*(n//4))

In [None]:
cat=label.astype('category')

In [None]:
label.memory_usage(deep=True)

4350128

In [None]:
cat.memory_usage(deep=True)

75410

#### GroupBy operations can be significantly faster with categoricals because the underlying algorithms use the integer-based codes array instead of an array of strings. Here we compare the performance of value_counts(), which internally uses the GroupBy machinery:


In [None]:
import time

In [None]:
%timeit label.value_counts()

6.39 ms ± 1.94 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [None]:
%timeit cat.value_counts()


732 µs ± 125 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


#### we can use the remove_unused_categories method to trim unobserved categories.
        Series_obj.cat.remove_unused_categories()

#### **Categorical methods for Series in pandas**

* **add_categories** Append new (unused) categories at end of existing categories
* **as_ordered** Make categories ordered
* **as_unordered** Make categories unordered
* **remove_categories** Remove categories,setting any removed values to null
* **remove_unused_categories** Remove any category values that do not appear in the data
* **rename_categories** Replace categories with indicated set of new category names;cannot change the number of categories
* **reorder_categories** Behaves like rename_categories, but can also change the result to have ordered categories
* **set_categories** Replace the categories with the indicated set of new categories;can add or remove categories


In [None]:

import pandas as pd
import numpy as np

In [None]:
frame = pd.DataFrame({"a": range(7), "b": range(7, 0, -1),"c": ["one", "one", "one", "two", "two","two", "two"],"d": [0, 1, 2, 0, 1, 2, 3]})


In [None]:
frame

Unnamed: 0,a,b,c,d
0,0,7,one,0
1,1,6,one,1
2,2,5,one,2
3,3,4,two,0
4,4,3,two,1
5,5,2,two,2
6,6,1,two,3


In [None]:
#DataFrame’s set_index function will create a new DataFrame using one or more of its columns as the index:
frame.set_index(['c','d'])

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1
one,0,0,7
one,1,1,6
one,2,2,5
two,0,3,4
two,1,4,3
two,2,5,2
two,3,6,1


In [None]:
#By default, the columns are removed from the DataFrame, though you can leave them in by passing drop=False to set_index
frame.set_index(['c','d'],drop=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c,d
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
one,0,0,7,one,0
one,1,1,6,one,1
one,2,2,5,one,2
two,0,3,4,two,0
two,1,4,3,two,1
two,2,5,2,two,2
two,3,6,1,two,3


In [None]:
#reset_index, on the other hand, does the opposite of set_index; the hierarchical
#index levels are moved into the columns:
frame.reset_index()

Unnamed: 0,index,a,b,c,d
0,0,0,7,one,0
1,1,1,6,one,1
2,2,2,5,one,2
3,3,3,4,two,0
4,4,4,3,two,1
5,5,5,2,two,2
6,6,6,1,two,3


#### There are a number of basic operations for rearranging tabular data.These are referred to as reshape or pivot operations.

#### Hierarchical indexing provides a consistent way to rearrange data in a DataFrame.There are two primary actions:

* **stack** -This “rotates” or pivots from the columns in the data to the rows.
* **unstack** -This pivots from the rows into the columns.



In [None]:
import pandas as pd
import numpy as np
data = pd.DataFrame(np.arange(6).reshape((2, 3)),
                    index=pd.Index(["Ohio", "Colorado"], name="state"),
                    columns=pd.Index(["one", "two", "three"],name="number"))

data


number,one,two,three
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,0,1,2
Colorado,3,4,5


In [None]:
data.stack()

state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int64

In [None]:
data.unstack()

number  state   
one     Ohio        0
        Colorado    3
two     Ohio        1
        Colorado    4
three   Ohio        2
        Colorado    5
dtype: int64

In [None]:
result=data.stack()
result.unstack()

number,one,two,three
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,0,1,2
Colorado,3,4,5


In [None]:
#An inverse operation to pivot for DataFrames is pandas.melt.
df = pd.DataFrame({"key": ["foo", "bar", "baz"],"A": [1, 2, 3],"B": [4, 5, 6],"C": [7, 8, 9]})
df


Unnamed: 0,key,A,B,C
0,foo,1,4,7
1,bar,2,5,8
2,baz,3,6,9


In [None]:
#The "key" column may be a group indicator, and the other columns are data values.
#When using pandas.melt, we must indicate which columns (if any) are group indicators.
res=pd.melt(df,id_vars='key')
print(res)

   key variable  value
0  foo        A      1
1  bar        A      2
2  baz        A      3
3  foo        B      4
4  bar        B      5
5  baz        B      6
6  foo        C      7
7  bar        C      8
8  baz        C      9


In [None]:
#Using pivot, we can reshape back to the original layout:
res.pivot(index='key',columns='variable',values='value')

variable,A,B,C
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,2,5,8
baz,3,6,9
foo,1,4,7


##### Some Interview based Question related to pandas

In [None]:
#import library
import pandas as pd


In [None]:
##Display the Version
pd.__version__

'1.5.3'

In [2]:
##Create DataFrame
import pandas as pd
import numpy as np
df=pd.DataFrame(np.random.standard_normal((4,6)))

In [3]:
pd.show_versions()




INSTALLED VERSIONS
------------------
commit           : 2e218d10984e9919f0296931d92ea851c6a6faf5
python           : 3.10.11.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.15.107+
Version          : #1 SMP Sat Apr 29 09:15:28 UTC 2023
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.5.3
numpy            : 1.22.4
pytz             : 2022.7.1
dateutil         : 2.8.2
setuptools       : 67.7.2
pip              : 23.1.2
Cython           : 0.29.34
pytest           : 7.2.2
hypothesis       : None
sphinx           : 3.5.4
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : 4.9.2
html5lib         : 1.1
pymysql          : None
psycopg2         : 2.9.6
jinja2           : 3.1.2
IPython          : 7.34.0
pandas_datareader: 0.10.0
bs4              : 4.11.2
bottleneck       : None
brotli           

In [4]:
#rename the columns name
df.columns=[
  'A','B','C','D','E','F'
]

In [5]:
df

Unnamed: 0,A,B,C,D,E,F
0,-0.557178,0.623962,0.318422,0.844767,2.274763,1.591667
1,-0.057118,0.488344,-0.480903,0.121318,0.542104,-0.121787
2,-0.154115,0.933997,0.076008,0.56533,-0.165408,-1.053167
3,2.372329,0.764797,-0.343273,1.65535,0.703059,-1.358635


In [6]:
#Replace the columns name
df=df.rename({'A':'cols_a','B':'cols_b','C':'cols_c','D':'cols_d','E':'cols_e','F':'cols_f'},axis='columns')

In [7]:
df

Unnamed: 0,cols_a,cols_b,cols_c,cols_d,cols_e,cols_f
0,-0.557178,0.623962,0.318422,0.844767,2.274763,1.591667
1,-0.057118,0.488344,-0.480903,0.121318,0.542104,-0.121787
2,-0.154115,0.933997,0.076008,0.56533,-0.165408,-1.053167
3,2.372329,0.764797,-0.343273,1.65535,0.703059,-1.358635


In [8]:
#Reversed the row order
df.loc[::-1].reset_index(drop=True)

Unnamed: 0,cols_a,cols_b,cols_c,cols_d,cols_e,cols_f
0,2.372329,0.764797,-0.343273,1.65535,0.703059,-1.358635
1,-0.154115,0.933997,0.076008,0.56533,-0.165408,-1.053167
2,-0.057118,0.488344,-0.480903,0.121318,0.542104,-0.121787
3,-0.557178,0.623962,0.318422,0.844767,2.274763,1.591667


In [9]:
#Reversed Columns order
df.loc[:,::-1]

Unnamed: 0,cols_f,cols_e,cols_d,cols_c,cols_b,cols_a
0,1.591667,2.274763,0.844767,0.318422,0.623962,-0.557178
1,-0.121787,0.542104,0.121318,-0.480903,0.488344,-0.057118
2,-1.053167,-0.165408,0.56533,0.076008,0.933997,-0.154115
3,-1.358635,0.703059,1.65535,-0.343273,0.764797,2.372329


In [10]:
#Add columns to dataframe with Missing as Value
df['cols_str']='Missing'

In [11]:
#Check info about the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   cols_a    4 non-null      float64
 1   cols_b    4 non-null      float64
 2   cols_c    4 non-null      float64
 3   cols_d    4 non-null      float64
 4   cols_e    4 non-null      float64
 5   cols_f    4 non-null      float64
 6   cols_str  4 non-null      object 
dtypes: float64(6), object(1)
memory usage: 352.0+ bytes


In [12]:
#select columnn name using the types
df.select_dtypes(include='object')

Unnamed: 0,cols_str
0,Missing
1,Missing
2,Missing
3,Missing


In [13]:
df.select_dtypes(include='float64')

Unnamed: 0,cols_a,cols_b,cols_c,cols_d,cols_e,cols_f
0,-0.557178,0.623962,0.318422,0.844767,2.274763,1.591667
1,-0.057118,0.488344,-0.480903,0.121318,0.542104,-0.121787
2,-0.154115,0.933997,0.076008,0.56533,-0.165408,-1.053167
3,2.372329,0.764797,-0.343273,1.65535,0.703059,-1.358635


In [14]:
#Below and above code return the same result
df.select_dtypes(include='number')

Unnamed: 0,cols_a,cols_b,cols_c,cols_d,cols_e,cols_f
0,-0.557178,0.623962,0.318422,0.844767,2.274763,1.591667
1,-0.057118,0.488344,-0.480903,0.121318,0.542104,-0.121787
2,-0.154115,0.933997,0.076008,0.56533,-0.165408,-1.053167
3,2.372329,0.764797,-0.343273,1.65535,0.703059,-1.358635


In [16]:
#Change the datatypes of columns
df['cols_b']=df['cols_b'].astype('str')

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   cols_a    4 non-null      float64
 1   cols_b    4 non-null      object 
 2   cols_c    4 non-null      float64
 3   cols_d    4 non-null      float64
 4   cols_e    4 non-null      float64
 5   cols_f    4 non-null      float64
 6   cols_str  4 non-null      object 
dtypes: float64(5), object(2)
memory usage: 352.0+ bytes


In [19]:
#Change the data type and view the same
df.astype({'cols_b':'float'}).dtypes

cols_a      float64
cols_b      float64
cols_c      float64
cols_d      float64
cols_e      float64
cols_f      float64
cols_str     object
dtype: object

In [20]:
#convert the datatype of object columna and fill with zero as a values
pd.to_numeric(df.cols_str,errors='coerce').fillna(0)

0    0.0
1    0.0
2    0.0
3    0.0
Name: cols_str, dtype: float64

In [21]:
df.apply(pd.to_numeric,errors='coerce').fillna(0)

Unnamed: 0,cols_a,cols_b,cols_c,cols_d,cols_e,cols_f,cols_str
0,-0.557178,0.623962,0.318422,0.844767,2.274763,1.591667,0.0
1,-0.057118,0.488344,-0.480903,0.121318,0.542104,-0.121787,0.0
2,-0.154115,0.933997,0.076008,0.56533,-0.165408,-1.053167,0.0
3,2.372329,0.764797,-0.343273,1.65535,0.703059,-1.358635,0.0


In [22]:
#Check the memory usage for each columns
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   cols_a    4 non-null      float64
 1   cols_b    4 non-null      object 
 2   cols_c    4 non-null      float64
 3   cols_d    4 non-null      float64
 4   cols_e    4 non-null      float64
 5   cols_f    4 non-null      float64
 6   cols_str  4 non-null      object 
dtypes: float64(5), object(2)
memory usage: 844.0 bytes


In [23]:
#Find the number of na values
df.isna().sum()

cols_a      0
cols_b      0
cols_c      0
cols_d      0
cols_e      0
cols_f      0
cols_str    0
dtype: int64

In [24]:
df=pd.DataFrame({'name':['Nitesh Pandey','Soumya Tiwari','Sarvesh Pandey']})
df

Unnamed: 0,name
0,Nitesh Pandey
1,Soumya Tiwari
2,Sarvesh Pandey


In [25]:
df.name.str.split(' ')

0     [Nitesh, Pandey]
1     [Soumya, Tiwari]
2    [Sarvesh, Pandey]
Name: name, dtype: object

In [26]:
#Split a string into multiple columns
df.name.str.split(' ',expand=True)

Unnamed: 0,0,1
0,Nitesh,Pandey
1,Soumya,Tiwari
2,Sarvesh,Pandey


In [27]:
#Expand a series of list into DataFrame
df=pd.DataFrame({'cols_one':['a','b','c'],'cols_two':[[10,20],[30,40],[50,60]]})
df

Unnamed: 0,cols_one,cols_two
0,a,"[10, 20]"
1,b,"[30, 40]"
2,c,"[50, 60]"


In [28]:
df.cols_two

0    [10, 20]
1    [30, 40]
2    [50, 60]
Name: cols_two, dtype: object

In [29]:
df.cols_two.apply(pd.Series)

Unnamed: 0,0,1
0,10,20
1,30,40
2,50,60


In [30]:
temp=df.cols_two
print(type(temp))

<class 'pandas.core.series.Series'>


In [31]:
#Two read elements of series one by one
for i in df.cols_two:
  for j in i:
    print(j)


10
20
30
40
50
60


#### **Data Oprtation in Pandas**
* Filtering -->query()
* Sorting   -->sort_values()
* Grouping  -->groupby()
* Aggregating -->mean(),median(),std()
* Merging and Joining -->merge() and join()
* Reshaping -->reshape
* apply -->apply()
