**Importing Pandas Library**

In [1]:
import pandas as pd

**Checking Pandas Version**

In [2]:
pd.__version__

'1.3.5'

**Pandas Series**

*It is a one-dimensional array holding data of any type*

In [3]:
x = [3,6,9,12,15,18,21]
ser = pd.Series(x)
ser

0     3
1     6
2     9
3    12
4    15
5    18
6    21
dtype: int64

**If nothing else is specified, the values are labeled with their index number**

**First value has index 0, second value has index 1 and so on**

**The label's can be used to access a specified value**

In [4]:
ser[0]

3

**Creating Labels**

In [5]:
x = [3,6,9,12,15,18,21]
ser = pd.Series(x,index = ["A","B","C","D","E","F","G"])
ser

A     3
B     6
C     9
D    12
E    15
F    18
G    21
dtype: int64

**Using Name Attribute**

In [6]:
x = [3,6,9,12,15,18,21]
ser = pd.Series(x,index = ["A","B","C","D","E","F","G"],name = "Multiple's of 3")
ser

A     3
B     6
C     9
D    12
E    15
F    18
G    21
Name: Multiple's of 3, dtype: int64

**When you have created labels, you can access an item by referring to the label**

In [7]:
ser["A"]

3

**Key/Value Pairs as Series**

*The keys of the dictionary become the labels*

In [8]:
cal = {"day1": 420, "day2": 380, "day3": 390,"day4":410,"day5":320}
cal_ser = pd.Series(cal)
cal_ser

day1    420
day2    380
day3    390
day4    410
day5    320
dtype: int64

**Access Item by using Key of the Dictionary**

In [9]:
cal_ser["day1"]

420

**Pandas DataFrames**

*It is a two-dimensional array holding data of any type in form of rows and columns*

In [10]:
cal = {"calories":[420,380,390,410,320],
      "duration":[50,40,45,47,49]}
cal_df = pd.DataFrame(cal)
cal_df

Unnamed: 0,calories,duration
0,420,50
1,380,40
2,390,45
3,410,47
4,320,49


**Index of Data Frame**

In [11]:
cal_df.index

RangeIndex(start=0, stop=5, step=1)

**Locate Row**

*Pandas use the loc attribute to return one or more specified row(s)*

In [12]:
cal_df.loc[0] # This attribute returns a Pandas Series

calories    420
duration     50
Name: 0, dtype: int64

**Access Multiple Rows**

*When using [ ] the result is a Pandas DataFrame*

In [13]:
cal_df.loc[[0,1]]

Unnamed: 0,calories,duration
0,420,50
1,380,40


**Named Indexes**

*With the index argument, you can name your own indexes*

In [14]:
cal = {"calories":[420,380,390,410,320],
      "duration":[50,40,45,47,49]}
cal_df = pd.DataFrame(cal,index = ["day1","day2","day3","day4","day5"])
cal_df

Unnamed: 0,calories,duration
day1,420,50
day2,380,40
day3,390,45
day4,410,47
day5,320,49


**Locate Named Indexes**

*Use the named index in the loc attribute to return the specified row(s)*

In [15]:
cal_df.loc["day1"]

calories    420
duration     50
Name: day1, dtype: int64

**Read CSV File**

In [16]:
df = pd.read_csv("/kaggle/input/canada-per-capita-income/canada_per_capita_income.csv")
df

Unnamed: 0,year,per capita income (US$)
0,1970,3399.299037
1,1971,3768.297935
2,1972,4251.175484
3,1973,4804.463248
4,1974,5576.514583
5,1975,5998.144346
6,1976,7062.131392
7,1977,7100.12617
8,1978,7247.967035
9,1979,7602.912681


**Changing the Index Column**

In [17]:
df = pd.read_csv("/kaggle/input/canada-per-capita-income/canada_per_capita_income.csv",index_col = "year")
df

Unnamed: 0_level_0,per capita income (US$)
year,Unnamed: 1_level_1
1970,3399.299037
1971,3768.297935
1972,4251.175484
1973,4804.463248
1974,5576.514583
1975,5998.144346
1976,7062.131392
1977,7100.12617
1978,7247.967035
1979,7602.912681


**Manipulating the Index**

In [18]:
df = pd.read_csv("/kaggle/input/canada-per-capita-income/canada_per_capita_income.csv")
df.set_index("year")

Unnamed: 0_level_0,per capita income (US$)
year,Unnamed: 1_level_1
1970,3399.299037
1971,3768.297935
1972,4251.175484
1973,4804.463248
1974,5576.514583
1975,5998.144346
1976,7062.131392
1977,7100.12617
1978,7247.967035
1979,7602.912681


**Use to_string() to print the entire DataFrame**

In [19]:
print(df.to_string())

    year  per capita income (US$)
0   1970              3399.299037
1   1971              3768.297935
2   1972              4251.175484
3   1973              4804.463248
4   1974              5576.514583
5   1975              5998.144346
6   1976              7062.131392
7   1977              7100.126170
8   1978              7247.967035
9   1979              7602.912681
10  1980              8355.968120
11  1981              9434.390652
12  1982              9619.438377
13  1983             10416.536590
14  1984             10790.328720
15  1985             11018.955850
16  1986             11482.891530
17  1987             12974.806620
18  1988             15080.283450
19  1989             16426.725480
20  1990             16838.673200
21  1991             17266.097690
22  1992             16412.083090
23  1993             15875.586730
24  1994             15755.820270
25  1995             16369.317250
26  1996             16699.826680
27  1997             17310.757750
28  1998      

**Conditional Selection**

In [20]:
df.year == 1970

0      True
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
30    False
31    False
32    False
33    False
34    False
35    False
36    False
37    False
38    False
39    False
40    False
41    False
42    False
43    False
44    False
45    False
46    False
Name: year, dtype: bool

**max_rows**

*The maximum number of rows returned is defined in Pandas option settings*

In [21]:
pd.options.display.max_rows

60

**Increase the maximum number of rows to display the entire DataFrame**

In [22]:
pd.options.display.max_rows = 999
pd.options.display.max_rows

999

**min_rows**

*The minimum number of rows returned is defined in Pandas option settings*

In [23]:
pd.options.display.min_rows

10

**Increase the minimum number of rows to display the entire DataFrame**

In [24]:
pd.options.display.min_rows = 50
pd.options.display.min_rows

50

**Viewing the Data**

*By default, the head() method returns the headers and top 5 rows*

In [25]:
df = pd.read_csv("/kaggle/input/canada-per-capita-income/canada_per_capita_income.csv")
df.head()

Unnamed: 0,year,per capita income (US$)
0,1970,3399.299037
1,1971,3768.297935
2,1972,4251.175484
3,1973,4804.463248
4,1974,5576.514583


**Viewing the top 10 Rows**

In [26]:
df.head(10)

Unnamed: 0,year,per capita income (US$)
0,1970,3399.299037
1,1971,3768.297935
2,1972,4251.175484
3,1973,4804.463248
4,1974,5576.514583
5,1975,5998.144346
6,1976,7062.131392
7,1977,7100.12617
8,1978,7247.967035
9,1979,7602.912681


**Viewing the Data**

*By default, the tail() method returns the headers and last 5 rows*

In [27]:
df = pd.read_csv("/kaggle/input/canada-per-capita-income/canada_per_capita_income.csv")
df.tail()

Unnamed: 0,year,per capita income (US$)
42,2012,42665.25597
43,2013,42676.46837
44,2014,41039.8936
45,2015,35175.18898
46,2016,34229.19363


**Viewing the last 10 Rows**

In [28]:
df.tail(10)

Unnamed: 0,year,per capita income (US$)
37,2007,36144.48122
38,2008,37446.48609
39,2009,32755.17682
40,2010,38420.52289
41,2011,42334.71121
42,2012,42665.25597
43,2013,42676.46837
44,2014,41039.8936
45,2015,35175.18898
46,2016,34229.19363


**Info About the Data**

*The DataFrames object has a method called info(), that gives you more information about the data set*

In [29]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47 entries, 0 to 46
Data columns (total 2 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   year                     47 non-null     int64  
 1   per capita income (US$)  47 non-null     float64
dtypes: float64(1), int64(1)
memory usage: 880.0 bytes


**Data Cleaning**

*Data cleaning means fixing bad data in your data set*

*Bad data could be:*

*-> Empty cells*

*-> Data in wrong format*

*-> Wrong data*

*-> Duplicates*

**Pandas - Cleaning Empty Cells**

*Empty cells can potentially give you a wrong result when you analyze data*

In [30]:
df = pd.read_csv("/kaggle/input/pandas-practice-dataset/data.csv")
df

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,'2020/12/01',110,130,409.1
1,60,'2020/12/02',117,145,479.0
2,60,'2020/12/03',103,135,340.0
3,45,'2020/12/04',109,175,282.4
4,45,'2020/12/05',117,148,406.0
5,60,'2020/12/06',102,127,300.0
6,60,'2020/12/07',110,136,374.0
7,450,'2020/12/08',104,134,253.3
8,30,'2020/12/09',109,133,195.1
9,60,'2020/12/10',98,124,269.0


**Replace Empty Values**

*A way of dealing with empty cells is to insert a new value instead*

*The fillna() method allows us to replace empty cells with a value*

*If you want to change the original DataFrame, use the inplace = True argument*

*The fillna(inplace = True) will NOT return a new DataFrame, but it will fill all rows containing NULL values from the original DataFrame*

In [31]:
df = df.fillna(130)
df

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,'2020/12/01',110,130,409.1
1,60,'2020/12/02',117,145,479.0
2,60,'2020/12/03',103,135,340.0
3,45,'2020/12/04',109,175,282.4
4,45,'2020/12/05',117,148,406.0
5,60,'2020/12/06',102,127,300.0
6,60,'2020/12/07',110,136,374.0
7,450,'2020/12/08',104,134,253.3
8,30,'2020/12/09',109,133,195.1
9,60,'2020/12/10',98,124,269.0


**Using Pandas replace() Method**

*This Method replace 450 from Duration Column to 45*

In [32]:
df.Duration.replace(450,45)

0     60
1     60
2     60
3     45
4     45
5     60
6     60
7     45
8     30
9     60
10    60
11    60
12    60
13    60
14    60
15    60
16    60
17    60
18    45
19    60
20    45
21    60
22    45
23    60
24    45
25    60
26    60
27    60
28    60
29    60
30    60
31    60
Name: Duration, dtype: int64

**Replace Only For Specified Columns**

*By using this syntax you can fill values at NaN position for Specific Column*

In [33]:
df = pd.read_csv("/kaggle/input/pandas-practice-dataset/data.csv")
new_df = df["Calories"].fillna(130)
new_df

0     409.1
1     479.0
2     340.0
3     282.4
4     406.0
5     300.0
6     374.0
7     253.3
8     195.1
9     269.0
10    329.3
11    250.7
12    250.7
13    345.3
14    379.3
15    275.0
16    215.2
17    300.0
18    130.0
19    323.0
20    243.0
21    364.2
22    282.0
23    300.0
24    246.0
25    334.5
26    250.0
27    241.0
28    130.0
29    280.0
30    380.3
31    243.0
Name: Calories, dtype: float64

**dropna() attribute return a new Data Frame with no empty cells**

*By default, the dropna() method returns a new DataFrame, and will not change the original*

*If you want to change the original DataFrame, use the inplace = True argument*

*The dropna(inplace = True) will NOT return a new DataFrame, but it will remove all rows containing NULL values from the original DataFrame*

In [34]:
df = pd.read_csv("/kaggle/input/pandas-practice-dataset/data.csv")
new_df = df.dropna()
new_df

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,'2020/12/01',110,130,409.1
1,60,'2020/12/02',117,145,479.0
2,60,'2020/12/03',103,135,340.0
3,45,'2020/12/04',109,175,282.4
4,45,'2020/12/05',117,148,406.0
5,60,'2020/12/06',102,127,300.0
6,60,'2020/12/07',110,136,374.0
7,450,'2020/12/08',104,134,253.3
8,30,'2020/12/09',109,133,195.1
9,60,'2020/12/10',98,124,269.0


**Delete from Specific Column**

*By using this syntax you can delete NaN Values from Specific Column*

In [35]:
df = pd.read_csv("/kaggle/input/pandas-practice-dataset/data.csv")
new_df = df["Calories"].dropna()
new_df

0     409.1
1     479.0
2     340.0
3     282.4
4     406.0
5     300.0
6     374.0
7     253.3
8     195.1
9     269.0
10    329.3
11    250.7
12    250.7
13    345.3
14    379.3
15    275.0
16    215.2
17    300.0
19    323.0
20    243.0
21    364.2
22    282.0
23    300.0
24    246.0
25    334.5
26    250.0
27    241.0
29    280.0
30    380.3
31    243.0
Name: Calories, dtype: float64

**Replace Using Mean, Median, or Mode**

*Pandas uses the mean() median() and mode() methods to calculate the respective values for a specified column*

In [36]:
df = pd.read_csv("/kaggle/input/pandas-practice-dataset/data.csv")
df_cal = df["Calories"]
df_cal

0     409.1
1     479.0
2     340.0
3     282.4
4     406.0
5     300.0
6     374.0
7     253.3
8     195.1
9     269.0
10    329.3
11    250.7
12    250.7
13    345.3
14    379.3
15    275.0
16    215.2
17    300.0
18      NaN
19    323.0
20    243.0
21    364.2
22    282.0
23    300.0
24    246.0
25    334.5
26    250.0
27    241.0
28      NaN
29    280.0
30    380.3
31    243.0
Name: Calories, dtype: float64

**Mean = the average value (the sum of all values divided by number of values)**

In [37]:
df = pd.read_csv("/kaggle/input/pandas-practice-dataset/data.csv")
df_mean = df["Calories"].mean()
df["Calories"].fillna(df_mean)

0     409.10
1     479.00
2     340.00
3     282.40
4     406.00
5     300.00
6     374.00
7     253.30
8     195.10
9     269.00
10    329.30
11    250.70
12    250.70
13    345.30
14    379.30
15    275.00
16    215.20
17    300.00
18    304.68
19    323.00
20    243.00
21    364.20
22    282.00
23    300.00
24    246.00
25    334.50
26    250.00
27    241.00
28    304.68
29    280.00
30    380.30
31    243.00
Name: Calories, dtype: float64

**Median = the value in the middle, after you have sorted all values ascending**

In [38]:
df = pd.read_csv("/kaggle/input/pandas-practice-dataset/data.csv")
df_median = df["Calories"].median()
df["Calories"].fillna(df_median)

0     409.1
1     479.0
2     340.0
3     282.4
4     406.0
5     300.0
6     374.0
7     253.3
8     195.1
9     269.0
10    329.3
11    250.7
12    250.7
13    345.3
14    379.3
15    275.0
16    215.2
17    300.0
18    291.2
19    323.0
20    243.0
21    364.2
22    282.0
23    300.0
24    246.0
25    334.5
26    250.0
27    241.0
28    291.2
29    280.0
30    380.3
31    243.0
Name: Calories, dtype: float64

**Mode = the value that appears most frequently**

In [39]:
df = pd.read_csv("/kaggle/input/pandas-practice-dataset/data.csv")
df_mode = df["Calories"].mode()[0]
df["Calories"].fillna(df_mode)

0     409.1
1     479.0
2     340.0
3     282.4
4     406.0
5     300.0
6     374.0
7     253.3
8     195.1
9     269.0
10    329.3
11    250.7
12    250.7
13    345.3
14    379.3
15    275.0
16    215.2
17    300.0
18    300.0
19    323.0
20    243.0
21    364.2
22    282.0
23    300.0
24    246.0
25    334.5
26    250.0
27    241.0
28    300.0
29    280.0
30    380.3
31    243.0
Name: Calories, dtype: float64

**Pandas - Cleaning Data of Wrong Format**

*Convert Into a Correct Format*

*In our Data Frame, we have two cells with the wrong format. Check out row 22 and 26, the 'Date' column should be a string that represents a date*

In [40]:
df = pd.read_csv("/kaggle/input/pandas-practice-dataset/data.csv")
df["Date"]

0     '2020/12/01'
1     '2020/12/02'
2     '2020/12/03'
3     '2020/12/04'
4     '2020/12/05'
5     '2020/12/06'
6     '2020/12/07'
7     '2020/12/08'
8     '2020/12/09'
9     '2020/12/10'
10    '2020/12/11'
11    '2020/12/12'
12    '2020/12/12'
13    '2020/12/13'
14    '2020/12/14'
15    '2020/12/15'
16    '2020/12/16'
17    '2020/12/17'
18    '2020/12/18'
19    '2020/12/19'
20    '2020/12/20'
21    '2020/12/21'
22             NaN
23    '2020/12/23'
24    '2020/12/24'
25    '2020/12/25'
26        20201226
27    '2020/12/27'
28    '2020/12/28'
29    '2020/12/29'
30    '2020/12/30'
31    '2020/12/31'
Name: Date, dtype: object

**Pandas has a to_datetime() method for Converting Date to right Format**

In [41]:
df = pd.read_csv("/kaggle/input/pandas-practice-dataset/data.csv")
df["Date"] = pd.to_datetime(df["Date"])
df["Date"]

0    2020-12-01
1    2020-12-02
2    2020-12-03
3    2020-12-04
4    2020-12-05
5    2020-12-06
6    2020-12-07
7    2020-12-08
8    2020-12-09
9    2020-12-10
10   2020-12-11
11   2020-12-12
12   2020-12-12
13   2020-12-13
14   2020-12-14
15   2020-12-15
16   2020-12-16
17   2020-12-17
18   2020-12-18
19   2020-12-19
20   2020-12-20
21   2020-12-21
22          NaT
23   2020-12-23
24   2020-12-24
25   2020-12-25
26   2020-12-26
27   2020-12-27
28   2020-12-28
29   2020-12-29
30   2020-12-30
31   2020-12-31
Name: Date, dtype: datetime64[ns]

**Pandas - Fixing Wrong Data**

*How can we fix wrong values, like the one for "Duration" in row 7?*

In [42]:
df = pd.read_csv("/kaggle/input/pandas-practice-dataset/data.csv")
df["Duration"]

0      60
1      60
2      60
3      45
4      45
5      60
6      60
7     450
8      30
9      60
10     60
11     60
12     60
13     60
14     60
15     60
16     60
17     60
18     45
19     60
20     45
21     60
22     45
23     60
24     45
25     60
26     60
27     60
28     60
29     60
30     60
31     60
Name: Duration, dtype: int64

**Replacing Values**

In [43]:
df.loc[7,"Duration"] = 45
df["Duration"]

0     60
1     60
2     60
3     45
4     45
5     60
6     60
7     45
8     30
9     60
10    60
11    60
12    60
13    60
14    60
15    60
16    60
17    60
18    45
19    60
20    45
21    60
22    45
23    60
24    45
25    60
26    60
27    60
28    60
29    60
30    60
31    60
Name: Duration, dtype: int64

*To replace wrong data for larger data sets you can create some rules, e.g. set some boundaries for legal values, and replace any values that are outside of the boundaries*

In [44]:
df = pd.read_csv("/kaggle/input/pandas-practice-dataset/data.csv")

for x in df.index:
    if df.loc[x,"Duration"] > 60:
        df.loc[x,"Duration"] = 60
        
df["Duration"]

0     60
1     60
2     60
3     45
4     45
5     60
6     60
7     60
8     30
9     60
10    60
11    60
12    60
13    60
14    60
15    60
16    60
17    60
18    45
19    60
20    45
21    60
22    45
23    60
24    45
25    60
26    60
27    60
28    60
29    60
30    60
31    60
Name: Duration, dtype: int64

**Pandas - Removing Duplicates**

*Duplicate rows are rows that have been registered more than one time*

*To discover duplicates, we can use the duplicated() method*

*The duplicated() method returns a Boolean values for each row*

In [45]:
df = pd.read_csv("/kaggle/input/pandas-practice-dataset/data.csv")
df.duplicated()

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12     True
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
30    False
31    False
dtype: bool

**Removing Duplicates**

*To remove duplicates, use the drop_duplicates() method*

*The (inplace = True) will make sure that the method does NOT return a new DataFrame, but it will remove all duplicates from the original DataFrame*

In [46]:
df.drop_duplicates()

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,'2020/12/01',110,130,409.1
1,60,'2020/12/02',117,145,479.0
2,60,'2020/12/03',103,135,340.0
3,45,'2020/12/04',109,175,282.4
4,45,'2020/12/05',117,148,406.0
5,60,'2020/12/06',102,127,300.0
6,60,'2020/12/07',110,136,374.0
7,450,'2020/12/08',104,134,253.3
8,30,'2020/12/09',109,133,195.1
9,60,'2020/12/10',98,124,269.0


**Pandas - Data Correlations**

*Finding Relationships*

*The corr() method calculates the relationship between each column in your data set*

*The corr() method ignores "not numeric" columns*

**The Result of the corr() method is a table with a lot of numbers that represents how well the relationship is between two columns**

**The number varies from -1 to 1**

*1 means that there is a 1 to 1 relationship (a perfect correlation), and for this data set, each time a value went up in the first column, the other one went up as well*

**Perfect Correlation**

*We can see that "Duration" and "Duration" got the number 1.000000, which makes sense, each column always has a perfect relationship with itself*

**Good Correlation**

*"Duration" and "Calories" got a 0.922721 correlation, which is a very good correlation, and we can predict that the longer you work out, the more calories you burn, and the other way around: if you burned a lot of calories, you probably had a long work out*

**Bad Correlation**

*"Duration" and "Maxpulse" got a 0.009403 correlation, which is a very bad correlation, meaning that we can not predict the max pulse by just looking at the duration of the work out, and vice versa*

In [47]:
df.corr()

Unnamed: 0,Duration,Pulse,Maxpulse,Calories
Duration,1.0,0.00441,0.049959,-0.114169
Pulse,0.00441,1.0,0.276583,0.513186
Maxpulse,0.049959,0.276583,1.0,0.35746
Calories,-0.114169,0.513186,0.35746,1.0


**loc[ ] and iloc[ ] Function**

In [48]:
df = pd.read_csv("/kaggle/input/pandas-practice-dataset/data.csv")
df

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,'2020/12/01',110,130,409.1
1,60,'2020/12/02',117,145,479.0
2,60,'2020/12/03',103,135,340.0
3,45,'2020/12/04',109,175,282.4
4,45,'2020/12/05',117,148,406.0
5,60,'2020/12/06',102,127,300.0
6,60,'2020/12/07',110,136,374.0
7,450,'2020/12/08',104,134,253.3
8,30,'2020/12/09',109,133,195.1
9,60,'2020/12/10',98,124,269.0


**Index-based Selection**

*Selecting First Column of CSV File*

In [49]:
df.iloc[0]

Duration              60
Date        '2020/12/01'
Pulse                110
Maxpulse             130
Calories           409.1
Name: 0, dtype: object

**Index Selection Using List**

*Selecting Row 1,3,5 of Column 1*

In [50]:
df.iloc[[1,3,5],1]

1    '2020/12/02'
3    '2020/12/04'
5    '2020/12/06'
Name: Date, dtype: object

**Label-based Selection**

In [51]:
df.loc[0,"Calories"]

409.1

**Label-based Selection with Condition**

In [52]:
df.loc[df.Duration == 60]

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,'2020/12/01',110,130,409.1
1,60,'2020/12/02',117,145,479.0
2,60,'2020/12/03',103,135,340.0
5,60,'2020/12/06',102,127,300.0
6,60,'2020/12/07',110,136,374.0
9,60,'2020/12/10',98,124,269.0
10,60,'2020/12/11',103,147,329.3
11,60,'2020/12/12',100,120,250.7
12,60,'2020/12/12',100,120,250.7
13,60,'2020/12/13',106,128,345.3


**Label-based Selection with Multiple Condition by Using "and" Operator**

In [53]:
df.loc[(df.Duration == 60) & (df.Pulse == 110)]

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,'2020/12/01',110,130,409.1
6,60,'2020/12/07',110,136,374.0


**Label-based Selection with Multiple Condition by Using "or" Operator**

In [54]:
df.loc[(df.Duration == 60) | (df.Pulse == 110)]

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,'2020/12/01',110,130,409.1
1,60,'2020/12/02',117,145,479.0
2,60,'2020/12/03',103,135,340.0
5,60,'2020/12/06',102,127,300.0
6,60,'2020/12/07',110,136,374.0
9,60,'2020/12/10',98,124,269.0
10,60,'2020/12/11',103,147,329.3
11,60,'2020/12/12',100,120,250.7
12,60,'2020/12/12',100,120,250.7
13,60,'2020/12/13',106,128,345.3


**isin() Built-in Conditional Selector**

*isin() lets you select data whose value "is in" a list of values*

In [55]:
df.loc[df.Pulse.isin([100,92])]

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
11,60,'2020/12/12',100,120,250.7
12,60,'2020/12/12',100,120,250.7
17,60,'2020/12/17',100,120,300.0
22,45,,100,119,282.0
26,60,20201226,100,120,250.0
27,60,'2020/12/27',92,118,241.0
29,60,'2020/12/29',100,132,280.0
31,60,'2020/12/31',92,115,243.0


**isnull() Built-in Conditional Selector**

*isnull() methods let you highlight values which are (or are not) empty (NaN)*

In [56]:
df.loc[df.Calories.notnull()]

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,'2020/12/01',110,130,409.1
1,60,'2020/12/02',117,145,479.0
2,60,'2020/12/03',103,135,340.0
3,45,'2020/12/04',109,175,282.4
4,45,'2020/12/05',117,148,406.0
5,60,'2020/12/06',102,127,300.0
6,60,'2020/12/07',110,136,374.0
7,450,'2020/12/08',104,134,253.3
8,30,'2020/12/09',109,133,195.1
9,60,'2020/12/10',98,124,269.0


**isnull() Conditional Selector**

*To select NaN entries you can use pd.isnull()*

In [57]:
df[pd.isnull(df.Calories)]

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
18,45,'2020/12/18',90,112,
28,60,'2020/12/28',103,132,


**Summary Functions**

In [58]:
df = pd.read_csv("/kaggle/input/pandas-practice-dataset/data.csv")
df.Pulse.describe() # This method generates a high-level summary of the attributes of the given column

count     32.000000
mean     103.500000
std        7.832933
min       90.000000
25%      100.000000
50%      102.500000
75%      106.500000
max      130.000000
Name: Pulse, dtype: float64

In [59]:
df.Pulse.unique() # To see a list of unique values we can use the unique() function

array([110, 117, 103, 109, 102, 104,  98, 100, 106,  90,  97, 108, 130,
       105,  92])

In [60]:
df.Pulse.value_counts() # To see a list of unique values and how often they occur in the dataset, we can use the value_counts() method

100    6
103    4
102    3
98     3
110    2
117    2
109    2
104    2
92     2
106    1
90     1
97     1
108    1
130    1
105    1
Name: Pulse, dtype: int64

**Renaming**

*This Method lets you change index names and/or column names*

In [61]:
df.rename(columns = {"Duration":"Time"})

Unnamed: 0,Time,Date,Pulse,Maxpulse,Calories
0,60,'2020/12/01',110,130,409.1
1,60,'2020/12/02',117,145,479.0
2,60,'2020/12/03',103,135,340.0
3,45,'2020/12/04',109,175,282.4
4,45,'2020/12/05',117,148,406.0
5,60,'2020/12/06',102,127,300.0
6,60,'2020/12/07',110,136,374.0
7,450,'2020/12/08',104,134,253.3
8,30,'2020/12/09',109,133,195.1
9,60,'2020/12/10',98,124,269.0


**Sorting**

*To get data in the order we want, we use sort_values() Method*

In [62]:
df.sort_values(by = "Duration")

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
8,30,'2020/12/09',109,133,195.1
20,45,'2020/12/20',97,125,243.0
3,45,'2020/12/04',109,175,282.4
4,45,'2020/12/05',117,148,406.0
22,45,,100,119,282.0
18,45,'2020/12/18',90,112,
24,45,'2020/12/24',105,132,246.0
19,60,'2020/12/19',103,123,323.0
21,60,'2020/12/21',108,131,364.2
23,60,'2020/12/23',130,101,300.0


**sort_values() defaults to an ascending sort, where the lowest values go first**

*However, most of the time we want a descending sort, where the higher numbers go first, then we use sort_values(by = "col_name", ascending = False)*

In [63]:
df.sort_values(by = "Duration",ascending = False)

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
7,450,'2020/12/08',104,134,253.3
0,60,'2020/12/01',110,130,409.1
15,60,'2020/12/15',98,123,275.0
30,60,'2020/12/30',102,129,380.3
29,60,'2020/12/29',100,132,280.0
28,60,'2020/12/28',103,132,
27,60,'2020/12/27',92,118,241.0
26,60,20201226,100,120,250.0
25,60,'2020/12/25',102,126,334.5
23,60,'2020/12/23',130,101,300.0


In [64]:
df.sort_index() # To sort by index values, use the companion method sort_index()

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,'2020/12/01',110,130,409.1
1,60,'2020/12/02',117,145,479.0
2,60,'2020/12/03',103,135,340.0
3,45,'2020/12/04',109,175,282.4
4,45,'2020/12/05',117,148,406.0
5,60,'2020/12/06',102,127,300.0
6,60,'2020/12/07',110,136,374.0
7,450,'2020/12/08',104,134,253.3
8,30,'2020/12/09',109,133,195.1
9,60,'2020/12/10',98,124,269.0


**Grouping**

*count() and len Function both gives the same result but count() gives the output as Series and len gives the output as Data Frame*

In [65]:
df = pd.read_csv("/kaggle/input/pandas-practice-dataset/data.csv")
df.groupby("Duration").Duration.count() # Grouping the Values of Duration Column by Combining it with Count Function

Duration
30      1
45      6
60     24
450     1
Name: Duration, dtype: int64

In [66]:
type(df.groupby("Duration").Duration.count()) # This returns a Pandas Series

pandas.core.series.Series

In [67]:
df.groupby(["Duration"]).Duration.agg([len]) # Grouping the Values of Duration Column by Combining it with length Function

Unnamed: 0_level_0,len
Duration,Unnamed: 1_level_1
30,1
45,6
60,24
450,1


In [68]:
type(df.groupby(["Duration"]).Duration.agg([len])) # This returns a Pandas Data Frame

pandas.core.frame.DataFrame

In [69]:
df.groupby("Duration").Pulse.min() # Grouping the Values of Duration Column with Minimum Pulse

Duration
30     109
45      90
60      92
450    104
Name: Pulse, dtype: int64