### Tricks and Tips at Jupyter Notebook

In [1]:
# Reminder

# Shift + Enter --> Run cell + New cell
# Ctrl + Enter --> Run cell
# kernel, "Restart & run all" --> Run all cells according the order at the notebook
# kernel, "Restart & clear output" --> Clear output from all cells

In [2]:
# show several prints in one cell. This will allow us to condence every trick in one cell.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [3]:
help(len) # Documention of functions

Help on built-in function len in module builtins:

len(obj, /)
    Return the number of items in a container.



In [4]:
# the ? syntax also works for any generic object
x = 5
x?

### Tab Completion 
If you’re using IPython, tab completion for column names (as well as public attributes) is automatically enabled. <br/>
Here’s a subset of the attributes that will be completed:

In [5]:
# lst.<TAB>

# try it ...
lst = [1, 2, 3, 4, 5]
# lst.

In [6]:
lst = ['Adi', 23, 'Dan', 1]
# try after lst --> Shift + tab --> for describing the varaible
lst

['Adi', 23, 'Dan', 1]

### Import Pandas Library

In [7]:
import pandas as pd

### Series

Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index.

Pandas Series is nothing but a column in an excel sheet.

### Creating a Series

In [8]:
s = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
print(s)
s['b']

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64


0.5

In [9]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
print(population,'\n')
print(population['Florida'])
population['Texas':'Illinois'] # notice that here the index is inclusive (From 'Texas' to 'Illinois')

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64 

19552860


Texas       26448193
New York    19651127
Florida     19552860
Illinois    12882135
dtype: int64

### DataFrame

A Data frame is a two-dimensional data structure with labeled axes (rows and columns).

### Pandas and dataframe
Pandas is mainly used for machine learning in form of dataframes. <br/>
Pandas allow importing data of various file formats such as csv, excel etc.<br/> 
Pandas allows various data manipulation operations such as groupby, join, merge, melt, concatenation as well as data cleaning features such as filling, replacing or imputing null values.

### Loading CSV file

In [10]:
df = pd.read_csv("saman.csv") # index_col='sample_num' - If we mention this parameter, sample_mean will be the index column

### Data Exploration

In [11]:
df

Unnamed: 0,sample_num,First_name,Last_name,Age,Weight_before,Weight_after
0,1,Jason,Miller,42,108,85
1,2,Molly,Jacobson,52,99,84
2,3,Tina,,36,134,67
3,4,Jake,Milner,24,112,62
4,5,Amy,Cooze,73,95,74
5,6,Joshua,Dylan,25,71,72
6,7,Daniel,Wilson,53,94,92
7,8,Nancy,King,37,85,73
8,9,Ross,Evans,63,79,76
9,10,Miles,Bark,39,102,96


In [12]:
df.shape # DataSet dimensions

(10, 6)

In [13]:
df.columns # Return a list of all columns

Index(['sample_num', 'First_name', 'Last_name', 'Age', 'Weight_before',
       'Weight_after'],
      dtype='object')

In [14]:
df.dtypes # Data type of each column (object = string)

sample_num        int64
First_name       object
Last_name        object
Age               int64
Weight_before     int64
Weight_after      int64
dtype: object

In [15]:
df.head() # First 5 rows (Default)

Unnamed: 0,sample_num,First_name,Last_name,Age,Weight_before,Weight_after
0,1,Jason,Miller,42,108,85
1,2,Molly,Jacobson,52,99,84
2,3,Tina,,36,134,67
3,4,Jake,Milner,24,112,62
4,5,Amy,Cooze,73,95,74


In [16]:
df.head(7)

Unnamed: 0,sample_num,First_name,Last_name,Age,Weight_before,Weight_after
0,1,Jason,Miller,42,108,85
1,2,Molly,Jacobson,52,99,84
2,3,Tina,,36,134,67
3,4,Jake,Milner,24,112,62
4,5,Amy,Cooze,73,95,74
5,6,Joshua,Dylan,25,71,72
6,7,Daniel,Wilson,53,94,92


In [17]:
df.tail() # Last 5 rows

Unnamed: 0,sample_num,First_name,Last_name,Age,Weight_before,Weight_after
5,6,Joshua,Dylan,25,71,72
6,7,Daniel,Wilson,53,94,92
7,8,Nancy,King,37,85,73
8,9,Ross,Evans,63,79,76
9,10,Miles,Bark,39,102,96


In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   sample_num     10 non-null     int64 
 1   First_name     10 non-null     object
 2   Last_name      9 non-null      object
 3   Age            10 non-null     int64 
 4   Weight_before  10 non-null     int64 
 5   Weight_after   10 non-null     int64 
dtypes: int64(4), object(2)
memory usage: 608.0+ bytes


In [19]:
df.describe() # Show only numerical columns

Unnamed: 0,sample_num,Age,Weight_before,Weight_after
count,10.0,10.0,10.0,10.0
mean,5.5,44.4,97.9,78.1
std,3.02765,15.805765,17.928562,10.867382
min,1.0,24.0,71.0,62.0
25%,3.25,36.25,87.25,72.25
50%,5.5,40.5,97.0,75.0
75%,7.75,52.75,106.5,84.75
max,10.0,73.0,134.0,96.0


In [20]:
df.describe(include="all") # including Objects types

Unnamed: 0,sample_num,First_name,Last_name,Age,Weight_before,Weight_after
count,10.0,10,9,10.0,10.0,10.0
unique,,10,9,,,
top,,Ross,Wilson,,,
freq,,1,1,,,
mean,5.5,,,44.4,97.9,78.1
std,3.02765,,,15.805765,17.928562,10.867382
min,1.0,,,24.0,71.0,62.0
25%,3.25,,,36.25,87.25,72.25
50%,5.5,,,40.5,97.0,75.0
75%,7.75,,,52.75,106.5,84.75


In [21]:
df["Age"] # Return a series of specific column

0    42
1    52
2    36
3    24
4    73
5    25
6    53
7    37
8    63
9    39
Name: Age, dtype: int64

In [22]:
df.iloc[7, :] # return specific row

sample_num           8
First_name       Nancy
Last_name         King
Age                 37
Weight_before       85
Weight_after        73
Name: 7, dtype: object

In [23]:
df.iloc[:,3] # Return a series of specific column

0    42
1    52
2    36
3    24
4    73
5    25
6    53
7    37
8    63
9    39
Name: Age, dtype: int64

In [24]:
df.iloc[7,1] # return specific cell

'Nancy'

In [25]:
df.iloc[2, [1,2]] # return the second row and columns 1,2

First_name    Tina
Last_name      NaN
Name: 2, dtype: object

In [26]:
df.iloc[0:9:2,] # return the even rows

Unnamed: 0,sample_num,First_name,Last_name,Age,Weight_before,Weight_after
0,1,Jason,Miller,42,108,85
2,3,Tina,,36,134,67
4,5,Amy,Cooze,73,95,74
6,7,Daniel,Wilson,53,94,92
8,9,Ross,Evans,63,79,76


In [27]:
df.iloc[2,2] = "Terner" # insert new value to a specific cell

In [28]:
df

Unnamed: 0,sample_num,First_name,Last_name,Age,Weight_before,Weight_after
0,1,Jason,Miller,42,108,85
1,2,Molly,Jacobson,52,99,84
2,3,Tina,Terner,36,134,67
3,4,Jake,Milner,24,112,62
4,5,Amy,Cooze,73,95,74
5,6,Joshua,Dylan,25,71,72
6,7,Daniel,Wilson,53,94,92
7,8,Nancy,King,37,85,73
8,9,Ross,Evans,63,79,76
9,10,Miles,Bark,39,102,96


In [29]:
df[df["Age"] > 42] # Filter according condition

Unnamed: 0,sample_num,First_name,Last_name,Age,Weight_before,Weight_after
1,2,Molly,Jacobson,52,99,84
4,5,Amy,Cooze,73,95,74
6,7,Daniel,Wilson,53,94,92
8,9,Ross,Evans,63,79,76


In [30]:
df[(df["Age"] > 42) & (df['Weight_before'] - df['Weight_after'] > 20)]

Unnamed: 0,sample_num,First_name,Last_name,Age,Weight_before,Weight_after
4,5,Amy,Cooze,73,95,74


In [31]:
df[(df["Age"] > 42) | (df['Weight_before'] - df['Weight_after'] > 20)]

Unnamed: 0,sample_num,First_name,Last_name,Age,Weight_before,Weight_after
0,1,Jason,Miller,42,108,85
1,2,Molly,Jacobson,52,99,84
2,3,Tina,Terner,36,134,67
3,4,Jake,Milner,24,112,62
4,5,Amy,Cooze,73,95,74
6,7,Daniel,Wilson,53,94,92
8,9,Ross,Evans,63,79,76


In [32]:
# Append new row
df = df.append({'sample_num' : 11,
               'First_name' : 'Adam',
               'Last_name' : 'Cohen',
               'Age' : 37, 
               'Weight_before' : 90,
               'Weight_after' : 85}, ignore_index=True)
df

Unnamed: 0,sample_num,First_name,Last_name,Age,Weight_before,Weight_after
0,1,Jason,Miller,42,108,85
1,2,Molly,Jacobson,52,99,84
2,3,Tina,Terner,36,134,67
3,4,Jake,Milner,24,112,62
4,5,Amy,Cooze,73,95,74
5,6,Joshua,Dylan,25,71,72
6,7,Daniel,Wilson,53,94,92
7,8,Nancy,King,37,85,73
8,9,Ross,Evans,63,79,76
9,10,Miles,Bark,39,102,96


In [33]:
# Append new column
df['Gender'] = ['M', 'F', 'F', 'M', 'F', 'M', 'M', 'F', 'M', 'M', 'M']
df

Unnamed: 0,sample_num,First_name,Last_name,Age,Weight_before,Weight_after,Gender
0,1,Jason,Miller,42,108,85,M
1,2,Molly,Jacobson,52,99,84,F
2,3,Tina,Terner,36,134,67,F
3,4,Jake,Milner,24,112,62,M
4,5,Amy,Cooze,73,95,74,F
5,6,Joshua,Dylan,25,71,72,M
6,7,Daniel,Wilson,53,94,92,M
7,8,Nancy,King,37,85,73,F
8,9,Ross,Evans,63,79,76,M
9,10,Miles,Bark,39,102,96,M


In [34]:
df.sort_values(by='Age') # Sorting by value

Unnamed: 0,sample_num,First_name,Last_name,Age,Weight_before,Weight_after,Gender
3,4,Jake,Milner,24,112,62,M
5,6,Joshua,Dylan,25,71,72,M
2,3,Tina,Terner,36,134,67,F
7,8,Nancy,King,37,85,73,F
10,11,Adam,Cohen,37,90,85,M
9,10,Miles,Bark,39,102,96,M
0,1,Jason,Miller,42,108,85,M
1,2,Molly,Jacobson,52,99,84,F
6,7,Daniel,Wilson,53,94,92,M
8,9,Ross,Evans,63,79,76,M


In [35]:
df['Age'].min()

24

In [36]:
# Copy by reference

new_df = df.copy() # For avoiding changing while working with one of these dataframes

### Creating DataFrame 

In [37]:
# Creating a dictionary
data = {'Student' : ['Dani', 'Kelly', 'Ronen', 'Max'],
        'Algebra' : [100, 82, 93, 60],
        'Data Science' : [88, 83, 98, 79]}

In [38]:
my_df = pd.DataFrame(data)
my_df

Unnamed: 0,Student,Algebra,Data Science
0,Dani,100,88
1,Kelly,82,83
2,Ronen,93,98
3,Max,60,79


In [39]:
# Creating sub dataframe
my_sub_df_1 = pd.DataFrame(data,columns = ['Student', 'Data Science']) # According the dictionary we created
my_sub_df_1

Unnamed: 0,Student,Data Science
0,Dani,88
1,Kelly,83
2,Ronen,98
3,Max,79


In [40]:
my_sub_df_2 = my_df.loc[:, ('Student', 'Data Science')] # According the dataframe we created
my_sub_df_2

Unnamed: 0,Student,Data Science
0,Dani,88
1,Kelly,83
2,Ronen,98
3,Max,79
