# pandas Data Structures:
We have learned about **Series**, lets learn DataFrames (2<sup>nd</sup> workhorse of pandas) to expand our concepts of Series.

* DataFrame
* Grab data (column wise)
* Grab data (raw wise)
* Grabbing an element or a sub-set of the dataframe
* Adding new column
* Deleting the column
* boolean_mask
* boolean_mask(Combine 2 conditions)
* reset_index(), set_index(), head(), tail(), info(), describe()

## DataFrame
* A very simple way to think about the DataFrame is, "bunch of Series together such as they share the same index". <br> 
* A DataFrams is a rectangular table of data that contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc). DataFrame has both a row and column index; it can be thought of as a dictionary of Series all sharing the same index. <br>

&#9758; *A good read for those, who are interested! ([Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do))<br>*

Let's learn **DataFrame with examples:**<br> 

In [1]:
import pandas as pd
import numpy as np

Let's create two labels/indexes:
* for rows 'r1 to r10'
* for columns 'c1 to c10'

Let's start with a simple example, using **`arange()`** and **`reshape()`** together to create a 2D array (matrix).<br>

In [2]:
index = 'r1 r2 r3 r4 r5 r6 r7 r8 r9 r10'.split()  #['r1', 'r2', 'r3', 'r4', 'r5', 'r6', 'r7', 'r8', r9, r10]
columns = 'c1 c2 c3 c4 c5 c6 c7 c8 c9 c10'.split()

&#9989; *Use **TAB** for auto-complete and **shift + TAB**  for doc.*

In [3]:
# How the index, columns and array_2d look like!
index

['r1', 'r2', 'r3', 'r4', 'r5', 'r6', 'r7', 'r8', 'r9', 'r10']

In [4]:
columns

['c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7', 'c8', 'c9', 'c10']

In [5]:
array_2d = np.arange(0,100).reshape(10,10)

In [6]:
array_2d

array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34, 35, 36, 37, 38, 39],
       [40, 41, 42, 43, 44, 45, 46, 47, 48, 49],
       [50, 51, 52, 53, 54, 55, 56, 57, 58, 59],
       [60, 61, 62, 63, 64, 65, 66, 67, 68, 69],
       [70, 71, 72, 73, 74, 75, 76, 77, 78, 79],
       [80, 81, 82, 83, 84, 85, 86, 87, 88, 89],
       [90, 91, 92, 93, 94, 95, 96, 97, 98, 99]])

In [37]:
# Let's create our first DataFrame using index, columns and array_2dnow 
df = pd.DataFrame(data = array_2d, index = index, columns = columns)
#df = pd.DataFrame(data = array_2d)

In [38]:
# How the DataFrame look like!
df  # select * from df 

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


**df** is our first dataframe. <br>
We have columns, c1 to c10, and their corresponding rows, r1 to r10. <br>
Each column is actually a pandas series, sharing a common index (row labels). <br>

&#9758; Let's learn how to **Grab data** that we need, this is the most important thing we want to learn to move one!<br>

### Columns 

In [24]:
# Grabbing a single column 
df['c10']  #select c10 from df
#df.c10
# The output looks like a series, right?.
# Also returned Series have the same index as the DataFrame

r1      9
r2     19
r3     29
r4     39
r5     49
r6     59
r7     69
r8     79
r9     89
r10    99
Name: c10, dtype: int32

In [13]:
type(df['c10']) # It is a pandas Series 

pandas.core.series.Series

In [20]:
# Grabbing more than one column, pass the list of columns you need! 
df[['c1', 'c10']] #select c1, c10 from df

pandas.core.frame.DataFrame

**df.column_name (e.g. df.c1, df.c2 etc)** can be used to grab a column as well, its good to know but I don't recommend. <br> 
If you press "TAB" after df., you will see lots of available methods, its good not to get confused with these option by using df.column_name.<br>
**Let's try this once**

In [None]:
df.c5 #df['c5']

### Adding new column
Lets try with "+" operation!

In [26]:
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [27]:
df['new']= df['c1'] + df['c2']  # select *, (c1 + c2) as new from df
#df.to_csv('abc.csv')

In [28]:
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,new
r1,0,1,2,3,4,5,6,7,8,9,1
r2,10,11,12,13,14,15,16,17,18,19,21
r3,20,21,22,23,24,25,26,27,28,29,41
r4,30,31,32,33,34,35,36,37,38,39,61
r5,40,41,42,43,44,45,46,47,48,49,81
r6,50,51,52,53,54,55,56,57,58,59,101
r7,60,61,62,63,64,65,66,67,68,69,121
r8,70,71,72,73,74,75,76,77,78,79,141
r9,80,81,82,83,84,85,86,87,88,89,161
r10,90,91,92,93,94,95,96,97,98,99,181


### Deleting the column -- `drop()`

        *df.drop('new')-- ValueError: labels ['new'] not contained in axis

Shift+tab, you see the default axis is 0, which refers to the index (row labels), for column, we need to specify axis = 1.<br>
&#9758; rows refer to 0 axis and columns refers to 1 axis<br> 
&#9758; Quick Check: *df.shape gives tuple (rows, cols) at [0] and [1]*

In [29]:
# We can delete a column using drop()
# df.drop('new')# ValueError: labels ['new'] not contained in axis
df.drop('new', axis=1)

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


&#9758; Is the "new" really deleted? <br>
Output df and you will see "new" is still there!<br>

In [30]:
df  

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,new
r1,0,1,2,3,4,5,6,7,8,9,1
r2,10,11,12,13,14,15,16,17,18,19,21
r3,20,21,22,23,24,25,26,27,28,29,41
r4,30,31,32,33,34,35,36,37,38,39,61
r5,40,41,42,43,44,45,46,47,48,49,81
r6,50,51,52,53,54,55,56,57,58,59,101
r7,60,61,62,63,64,65,66,67,68,69,121
r8,70,71,72,73,74,75,76,77,78,79,141
r9,80,81,82,83,84,85,86,87,88,89,161
r10,90,91,92,93,94,95,96,97,98,99,181


To delete the column, you have to tell the pandas by setting<br>
* ***inplace = True*** (default is inplace=False).<br>

&#9989; *pandas is generous, it does not want us to lose the information by any mistake and needs inplace*

In [31]:
df.drop('new',axis = 1, inplace = True)
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


### Rows
We can retrieve a row by its name or position with **[`loc`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html)** and **[`iloc`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iloc.html)**.<br>
**loc** -- Access a group of rows and columns by label(s)

In [33]:
# df['r1'] # KeyError: 'r1'
df.loc[['r1']] # loc for location in square brackets
# we see that the rows are series as well!

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9


Using row's index location with **iloc**, even if our index is labeled.

In [39]:
df.iloc[[0]] # iloc[index], index based location

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9


In [40]:
# more than one rows -- pass a list of rows!
df.loc[['r1','r2', 'r3']]  

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29


### Grabbing an element or a sub-set of the dataframe

In [41]:
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [42]:
df.loc[['r2', 'r5']]

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r2,10,11,12,13,14,15,16,17,18,19
r5,40,41,42,43,44,45,46,47,48,49


In [43]:
# df.loc(req_row, re_col) -- pass row, col for the element!
df.loc['r1','c1']

0

In [44]:
# for a sub-set, pass the list
df.loc[['r1','r2'],['c1','c2']]

Unnamed: 0,c1,c2
r1,0,1
r2,10,11


In [46]:
# another example - random columns and rows in the list 
df.loc[['r2','r5'],['c3','c4']]

Unnamed: 0,c3,c4
r2,12,13
r5,42,43


In [45]:
df  

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [None]:
# We can do a conditional selection as well
df > 5
# df!=0 
# df=0

In [49]:
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


This is similar to NumPy boolean mask, lets try this:

    *bool_mask = df % 3 == 0
    *df[bool_mask]
returns values where it is True and NaN where False. 

In [51]:
# Return Divisible by 3 
bool_mask = df % 3 == 0
bool_mask
df[bool_mask]
# One step and easier to do 
# df[df % 3 == 0]

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0.0,,,3.0,,,6.0,,,9.0
r2,,,12.0,,,15.0,,,18.0,
r3,,21.0,,,24.0,,,27.0,,
r4,30.0,,,33.0,,,36.0,,,39.0
r5,,,42.0,,,45.0,,,48.0,
r6,,51.0,,,54.0,,,57.0,,
r7,60.0,,,63.0,,,66.0,,,69.0
r8,,,72.0,,,75.0,,,78.0,
r9,,81.0,,,84.0,,,87.0,,
r10,90.0,,,93.0,,,96.0,,,99.0


&#9758; Its not common to use such operation on entire dataframe. We usually use them on a columns or rows instead.<br>
**For example, we don't want a row with NaN values.**<br>
What to do?<br>
Let's have a look at one example.

In [None]:
# Our original dataframe is 
df  # Select * from df where c1 > 11   

Let's apply a condition on column c1, say `c1 > 11`<br>
based on the conditional selection, the out put will be:

In [48]:
df['c1']>11  #[df['c1'] > 11]
#df[df['c1']>11]

r1     False
r2     False
r3      True
r4      True
r5      True
r6      True
r7      True
r8      True
r9      True
r10     True
Name: c1, dtype: bool

We don't want `r1` and `r2` as they return NaN or null values. <br>
Let's filter the rows based on condition on column values.

In [47]:
df[df['c1']>11]  # Select c1,c2 from df where c1 > 11
# We will use such operation frequently in our course.

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


&#9758; The above, **"`df[df['c1']>11]`"** is a dataframe with applied condition, we can select any col from this dataframe.<br> For example:

In [52]:
result = df[df['c1']>11]
result

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


We can do the above operations, (filtering and selecting a columns) in a single line (stack commonds). 


In [58]:
df[df['c1']>11][['c1', 'c10']]
# Could be little confusing for the beginners, but don't worry, we will 
# use such operations frequently in the course as well, you will find 
# them very handy. 

Unnamed: 0,c1,c10
r3,20,29
r4,30,39
r5,40,49
r6,50,59
r7,60,69
r8,70,79
r9,80,89
r10,90,99


In [None]:
# let's grab two columns, we need to pass the list ['c1','c9'] here
df[df['c1']>11][['c1','c9']]  #Select c1, c9 from df where c1 > 11

In [59]:
# We can do this operation on rows using loc 
# Passing multiple rows in a list
df[df['c1']>11].loc[['r3','r5']]

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r3,20,21,22,23,24,25,26,27,28,29
r5,40,41,42,43,44,45,46,47,48,49


In [None]:
result = df['c1']==70 
result

In [None]:
df[result]

In [None]:
df[df['c1']==70]  #select * from df where c1 == 70

### Combine 2 conditions 
Let's try on c1 for a value > 60 and on c2 for a value > 80

In [60]:
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [63]:
df[(df['c1']>60) & (df['c2']>80)]  # select * from df where c1>60 and c2>80
# notice (df['c1']>60)&(df['c2']>80) in () for clear saperation
# with in [] wrapped in df []

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


&#9989;**NOTE:**<br>
"and" operator will not work in the above condition and using "and" will return <br>

        *ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

This "ambiguous" means, True, only work for a single booleans at a time "True and False". We need to use "&" instead. ("|" for or)<br>
Try the above code using "and" <br>
The "and" operator gets confused with series of True/False and raise Error

### Let's have a quick look on couple of useful methods.
***We will explore more later on in the course!***

**`reset_index()`** and **`set_index()`**<br>
We can reset the index of our dataframe to numerical index (which is default index), `inplace = True` to make the permanent change. *The existing index will be a new column.*

In [64]:
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [65]:
df.reset_index(inplace = True)

In [66]:
df

Unnamed: 0,index,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
0,r1,0,1,2,3,4,5,6,7,8,9
1,r2,10,11,12,13,14,15,16,17,18,19
2,r3,20,21,22,23,24,25,26,27,28,29
3,r4,30,31,32,33,34,35,36,37,38,39
4,r5,40,41,42,43,44,45,46,47,48,49
5,r6,50,51,52,53,54,55,56,57,58,59
6,r7,60,61,62,63,64,65,66,67,68,69
7,r8,70,71,72,73,74,75,76,77,78,79
8,r9,80,81,82,83,84,85,86,87,88,89
9,r10,90,91,92,93,94,95,96,97,98,99


In [70]:
df.set_index('c2', inplace = True)
df

Unnamed: 0_level_0,c3,c4,c5,c6,c7,c8,c9,c10
c2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,2,3,4,5,6,7,8,9
11,12,13,14,15,16,17,18,19
21,22,23,24,25,26,27,28,29
31,32,33,34,35,36,37,38,39
41,42,43,44,45,46,47,48,49
51,52,53,54,55,56,57,58,59
61,62,63,64,65,66,67,68,69
71,72,73,74,75,76,77,78,79
81,82,83,84,85,86,87,88,89
91,92,93,94,95,96,97,98,99


In [71]:
df

Unnamed: 0_level_0,c3,c4,c5,c6,c7,c8,c9,c10
c2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,2,3,4,5,6,7,8,9
11,12,13,14,15,16,17,18,19
21,22,23,24,25,26,27,28,29
31,32,33,34,35,36,37,38,39
41,42,43,44,45,46,47,48,49
51,52,53,54,55,56,57,58,59
61,62,63,64,65,66,67,68,69
71,72,73,74,75,76,77,78,79
81,82,83,84,85,86,87,88,89
91,92,93,94,95,96,97,98,99


** consider, We have a column in our data that could be a useful index,<br>
we want to set that column as an index!**<br>

In [72]:
array_2d

array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34, 35, 36, 37, 38, 39],
       [40, 41, 42, 43, 44, 45, 46, 47, 48, 49],
       [50, 51, 52, 53, 54, 55, 56, 57, 58, 59],
       [60, 61, 62, 63, 64, 65, 66, 67, 68, 69],
       [70, 71, 72, 73, 74, 75, 76, 77, 78, 79],
       [80, 81, 82, 83, 84, 85, 86, 87, 88, 89],
       [90, 91, 92, 93, 94, 95, 96, 97, 98, 99]])

In [73]:
columns

['c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7', 'c8', 'c9', 'c10']

In [74]:
df = pd.DataFrame(data = array_2d, index = index, columns = columns)
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [None]:
abc = 'a b c d e f g h i j'.split() # split at white spaces
# let put newind as a col in the df
#df2 = df
df['newind']=abc
df
#df = pd.DataFrame(data=array_2d, index=index, columns=columns)

In [None]:
# setting newind as an index, needs to be inplaced
df.set_index('newind', inplace = True)

In [None]:
df

### `head()`, `tail()`

In [79]:
# Returns first n rows
df.head(5) # n = 5 by default 

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49


In [82]:
# Returns last n rows
df.tail(2) # n = 5 by default

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


### `info()`
Provides a concise summary of the DataFrame.

In [83]:
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [84]:

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, r1 to r10
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   c1      10 non-null     int32
 1   c2      10 non-null     int32
 2   c3      10 non-null     int32
 3   c4      10 non-null     int32
 4   c5      10 non-null     int32
 5   c6      10 non-null     int32
 6   c7      10 non-null     int32
 7   c8      10 non-null     int32
 8   c9      10 non-null     int32
 9   c10     10 non-null     int32
dtypes: int32(10)
memory usage: 480.0+ bytes


### `describe()`
Generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset's distribution, excluding `NaN` values.

In [85]:
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [86]:
df.describe()

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
count,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0
mean,45.0,46.0,47.0,48.0,49.0,50.0,51.0,52.0,53.0,54.0
std,30.276504,30.276504,30.276504,30.276504,30.276504,30.276504,30.276504,30.276504,30.276504,30.276504
min,0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0
25%,22.5,23.5,24.5,25.5,26.5,27.5,28.5,29.5,30.5,31.5
50%,45.0,46.0,47.0,48.0,49.0,50.0,51.0,52.0,53.0,54.0
75%,67.5,68.5,69.5,70.5,71.5,72.5,73.5,74.5,75.5,76.5
max,90.0,91.0,92.0,93.0,94.0,95.0,96.0,97.0,98.0,99.0


# Excellent! 
I want to congratulate here, you are making a great progress, keep it up!

In [90]:
df3 = pd.read_csv('E:\Breast_Cancer_Diagnostic.csv')

In [93]:
df3

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,...,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890,
1,842517,M,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,...,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902,
2,84300903,M,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,...,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,...,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300,
4,84358402,M,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,...,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,926424,M,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,...,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115,
565,926682,M,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,...,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637,
566,926954,M,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,...,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820,
567,927241,M,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,...,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400,


In [89]:
df3.describe()

Unnamed: 0,id,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,0.0
mean,30371830.0,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,...,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946,
std,125020600.0,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,...,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061,
min,8670.0,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,...,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504,
25%,869218.0,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,...,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146,
50%,906024.0,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,...,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004,
75%,8813129.0,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,...,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208,
max,911320500.0,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,...,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075,
