Работа с Jupyter Notebook
==

## Содержание
1. [Чтение данных](#read_data1)
2. [Анализ](#analysis)
3. [Примеры работы с базовым функционалом](#examples)
4. [Полезные ссылки](#additional_info)

___

In [1]:
import pandas as pd
import numpy as np

# Для Linux / MacOS
# import ssl
# ssl._create_default_https_context = ssl._create_unverified_context

## Чтение данных <a name="read_data1"></a>

In [2]:
%%time
df = pd.read_csv('https://code.s3.yandex.net/datasets/data.csv')

CPU times: total: 188 ms
Wall time: 541 ms


## Анализ <a name="analysis"></a>

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21525 entries, 0 to 21524
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   children          21525 non-null  int64  
 1   days_employed     19351 non-null  float64
 2   dob_years         21525 non-null  int64  
 3   education         21525 non-null  object 
 4   education_id      21525 non-null  int64  
 5   family_status     21525 non-null  object 
 6   family_status_id  21525 non-null  int64  
 7   gender            21525 non-null  object 
 8   income_type       21525 non-null  object 
 9   debt              21525 non-null  int64  
 10  total_income      19351 non-null  float64
 11  purpose           21525 non-null  object 
dtypes: float64(2), int64(5), object(5)
memory usage: 2.0+ MB


Подсчет **количества** отсутствующих значений

In [4]:
df.isna().sum()

children               0
days_employed       2174
dob_years              0
education              0
education_id           0
family_status          0
family_status_id       0
gender                 0
income_type            0
debt                   0
total_income        2174
purpose                0
dtype: int64

Подсчет **доли** отсутствующих значений с округлением

In [5]:
round(df.isna().sum() * 100 / len(df), 2)

children             0.0
days_employed       10.1
dob_years            0.0
education            0.0
education_id         0.0
family_status        0.0
family_status_id     0.0
gender               0.0
income_type          0.0
debt                 0.0
total_income        10.1
purpose              0.0
dtype: float64

Сводная информация по _параметрам_ данных

In [6]:
df.describe()

Unnamed: 0,children,days_employed,dob_years,education_id,family_status_id,debt,total_income
count,21525.0,19351.0,21525.0,21525.0,21525.0,21525.0,19351.0
mean,0.538908,63046.497661,43.29338,0.817236,0.972544,0.080883,167422.3
std,1.381587,140827.311974,12.574584,0.548138,1.420324,0.272661,102971.6
min,-1.0,-18388.949901,0.0,0.0,0.0,0.0,20667.26
25%,0.0,-2747.423625,33.0,1.0,0.0,0.0,103053.2
50%,0.0,-1203.369529,42.0,1.0,0.0,0.0,145017.9
75%,1.0,-291.095954,53.0,1.0,1.0,0.0,203435.1
max,20.0,401755.400475,75.0,4.0,4.0,1.0,2265604.0


____

## Примеры работы с базовым функционалом <a name="examples"></a>

### Create sample data

In [7]:
sample_data = {'col_1': [np.nan, 3, 2, 1, 0, np.nan, 4], 
               'col_2': ['a', 'b', 'c', 'd', 'e', 'f', np.nan]}

sample_df = pd.DataFrame.from_dict(sample_data)
sample_df

Unnamed: 0,col_1,col_2
0,,a
1,3.0,b
2,2.0,c
3,1.0,d
4,0.0,e
5,,f
6,4.0,


### Rename several DataFrame columns

In [8]:
sample_df.rename(columns = {
    'col_1':'col1',
    'col_2':'col2'
})

Unnamed: 0,col1,col2
0,,a
1,3.0,b
2,2.0,c
3,1.0,d
4,0.0,e
5,,f
6,4.0,


### Fill NA's

In [9]:
# forward fill
sample_df.fillna(method='ffill')

Unnamed: 0,col_1,col_2
0,,a
1,3.0,b
2,2.0,c
3,1.0,d
4,0.0,e
5,0.0,f
6,4.0,f


In [10]:
# backward fill
sample_df.fillna(method='bfill')

Unnamed: 0,col_1,col_2
0,3.0,a
1,3.0,b
2,2.0,c
3,1.0,d
4,0.0,e
5,4.0,f
6,4.0,


In [11]:
# fill column with mean
sample_df['col_1'].fillna((sample_df['col_1'].mean()))

0    2.0
1    3.0
2    2.0
3    1.0
4    0.0
5    2.0
6    4.0
Name: col_1, dtype: float64

In [12]:
# clean up missing values in multiple DataFrame columns
sample_df.fillna({
    'col_1': 0.0,
    'col_2': 'missing value'}
)

Unnamed: 0,col_1,col_2
0,0.0,a
1,3.0,b
2,2.0,c
3,1.0,d
4,0.0,e
5,0.0,f
6,4.0,missing value


### Drop NA's

In [13]:
# rows
sample_df.dropna()

Unnamed: 0,col_1,col_2
1,3.0,b
2,2.0,c
3,1.0,d
4,0.0,e


In [14]:
# columns
sample_df.dropna(axis=1)

0
1
2
3
4
5
6


In [15]:
# drop columns with all nans
sample_df.dropna(axis='columns', how='all')

Unnamed: 0,col_1,col_2
0,,a
1,3.0,b
2,2.0,c
3,1.0,d
4,0.0,e
5,,f
6,4.0,


In [16]:
# minimum number of not nan values
sample_df.dropna(axis='rows', thresh=2)

Unnamed: 0,col_1,col_2
1,3.0,b
2,2.0,c
3,1.0,d
4,0.0,e


In [17]:
# specified columns
sample_df.dropna(subset=['col_1', 'col_2'])

Unnamed: 0,col_1,col_2
1,3.0,b
2,2.0,c
3,1.0,d
4,0.0,e


### Column names

In [18]:
# Lower-case all DataFrame column names
sample_df.columns.str.lower()

Index(['col_1', 'col_2'], dtype='object')

In [19]:
# Update all DataFrame column names
sample_df.columns.map(lambda x: x + '!')

Index(['col_1!', 'col_2!'], dtype='object')

### Column data

In [20]:
def find_cat_and_num_cols(df):
    """Find categorical and numerical columns
    """
    
    categorical_columns = [c for c in df.columns if df[c].dtype.name == 'object']
    numerical_columns   = [c for c in df.columns if df[c].dtype.name != 'object']
    
    return categorical_columns, numerical_columns

find_cat_and_num_cols(sample_df)

(['col_2'], ['col_1'])

In [21]:
# Lower-case column values
sample_df['col_2'].str.lower()

0      a
1      b
2      c
3      d
4      e
5      f
6    NaN
Name: col_2, dtype: object

In [22]:
# Replace multiple spaces in column
sample_df['col_2'].replace('\s+', ' ', regex=True)

0      a
1      b
2      c
3      d
4      e
5      f
6    NaN
Name: col_2, dtype: object

In [23]:
# Replace text values with their numeric equivalent by using replace
cleanup_nums = {'num_1': {'four': 4, 'two': 2},
                'num_2': {'four': 4, 'six': 6, 'five': 5, 'eight': 8,
                                  'two': 2, 'twelve': 12, 'three': 3}}

sample_df.replace(cleanup_nums)

Unnamed: 0,col_1,col_2
0,,a
1,3.0,b
2,2.0,c
3,1.0,d
4,0.0,e
5,,f
6,4.0,


In [24]:
# Concatenate two DataFrame columns into a new, single column
sample_df['col_3'] = (
    sample_df['col_1'].map(str) + ' ' + sample_df['col_2'].map(str)
)
sample_df

Unnamed: 0,col_1,col_2,col_3
0,,a,nan a
1,3.0,b,3.0 b
2,2.0,c,2.0 c
3,1.0,d,1.0 d
4,0.0,e,0.0 e
5,,f,nan f
6,4.0,,4.0 nan


In [25]:
# Split DataFrame column into two columns by values
sample_df[['col_3_1', 'col_3_2']] = sample_df['col_3'].str.split(' ',
                                                                 expand=True)
sample_df

Unnamed: 0,col_1,col_2,col_3,col_3_1,col_3_2
0,,a,nan a,,a
1,3.0,b,3.0 b,3.0,b
2,2.0,c,2.0 c,2.0,c
3,1.0,d,1.0 d,1.0,d
4,0.0,e,0.0 e,0.0,e
5,,f,nan f,,f
6,4.0,,4.0 nan,4.0,


In [26]:
# Sort dataframe by column
# NaN values will be first
sample_df.sort_values('col_1', ascending=False, na_position='first')

Unnamed: 0,col_1,col_2,col_3,col_3_1,col_3_2
0,,a,nan a,,a
5,,f,nan f,,f
6,4.0,,4.0 nan,4.0,
1,3.0,b,3.0 b,3.0,b
2,2.0,c,2.0 c,2.0,c
3,1.0,d,1.0 d,1.0,d
4,0.0,e,0.0 e,0.0,e


In [27]:
# Sort dataframe by multiple columns
sample_df.sort_values(['col_1', 'col_2'], ascending=[True, False])

Unnamed: 0,col_1,col_2,col_3,col_3_1,col_3_2
4,0.0,e,0.0 e,0.0,e
3,1.0,d,1.0 d,1.0,d
2,2.0,c,2.0 c,2.0,c
1,3.0,b,3.0 b,3.0,b
6,4.0,,4.0 nan,4.0,
5,,f,nan f,,f
0,,a,nan a,,a


In [28]:
# Grab DataFrame rows where column has certain values
values = ['a', 'b', 'c']
sample_df[sample_df['col_2'].isin(values)]

Unnamed: 0,col_1,col_2,col_3,col_3_1,col_3_2
0,,a,nan a,,a
1,3.0,b,3.0 b,3.0,b
2,2.0,c,2.0 c,2.0,c


In [29]:
# Grab DataFrame rows where column doesn't have certain values
values = ['a', 'b', 'c']
sample_df[~sample_df['col_2'].isin(values)]

Unnamed: 0,col_1,col_2,col_3,col_3_1,col_3_2
3,1.0,d,1.0 d,1.0,d
4,0.0,e,0.0 e,0.0,e
5,,f,nan f,,f
6,4.0,,4.0 nan,4.0,


In [30]:
# Select from DataFrame using conditions for multiple columns
# (`|` is for OR; use `&` instead of `|` for AND)
sample_df[(sample_df['col_1'] >= 2) | (sample_df['col_2'] == 'c')]

Unnamed: 0,col_1,col_2,col_3,col_3_1,col_3_2
1,3.0,b,3.0 b,3.0,b
2,2.0,c,2.0 c,2.0,c
6,4.0,,4.0 nan,4.0,


## Перенос длинных строк
В соответствии с [PEP8](https://pythonworld.ru/osnovy/pep-8-rukovodstvo-po-napisaniyu-koda-na-python.html) длинной строкой считается срока, состоящая из более, чем **79 символов**

**[Подробнее о переносах строк в Python](https://tirinox.ru/new-line/)**

In [31]:
print('This is a really long sentence,'  #              here are the last chars
      'but we can make it across multiple lines.')  #   here are the last chars

This is a really long sentence,but we can make it across multiple lines.


In [32]:
# backslash (explicit line joining)
a = '1' + '2' + '3' + \
    '4' + '5'

In [33]:
# using parentheses (implicit line joining) 
a = ('1' + '2' + '3' +
    '4' + '5')

In [34]:
# using parentheses (implicit line joining)
a = True
b = False

if (a == True and
    b == False):
    print('Do something')

Do something


In [35]:
# backslash (explicit line joining)
if a == True and \
    b == False:
    print('Do something')

Do something


## Полезные ссылки <a name="additional_info"></a>
- [Способы показа графиков](https://pyprog.pro/mpl/mpl_magic_teams.html)
- [Как использовать JN на 100% (видео)](https://www.youtube.com/watch?v=q4d-hKCpTEc)
- [Cheat Sheet JN](https://www.edureka.co/blog/wp-content/uploads/2018/10/Jupyter_Notebook_CheatSheet_Edureka.pdf)
- [Cheat Sheet Python](https://perso.limsi.fr/pointal/_media/python:cours:mementopython3-english.pdf)
- [Cheat Sheet Pandas](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)
- [Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/) 
- [Руководство по написанию кода на Python](https://pythonworld.ru/osnovy/pep-8-rukovodstvo-po-napisaniyu-koda-na-python.html)
- <code style="background:yellow;color:black">[The Ultimate Markdown Guide for Jupyter Notebook](https://medium.com/analytics-vidhya/the-ultimate-markdown-guide-for-jupyter-notebook-d5e5abf728fd)</code>