# Pandas базовые операции, индексирование

In [1]:
import pandas as pd

Считаем тестовый набор данных

In [2]:
df = pd.read_csv('./data/StudentsPerformance.csv')

Выведем 5 первых наборов

In [3]:
df.head(5)

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


Выведем показатели описательной статистики по столбцам

In [4]:
df.describe()

Unnamed: 0,math score,reading score,writing score
count,1000.0,1000.0,1000.0
mean,66.089,69.169,68.054
std,15.16308,14.600192,15.195657
min,0.0,17.0,10.0
25%,57.0,59.0,57.75
50%,66.0,70.0,69.0
75%,77.0,79.0,79.0
max,100.0,100.0,100.0


Выведем типы данных

In [5]:
df.dtypes

gender                         object
race/ethnicity                 object
parental level of education    object
lunch                          object
test preparation course        object
math score                      int64
reading score                   int64
writing score                   int64
dtype: object

Выведем размерность данных

In [6]:
df.shape

(1000, 8)

Сгруппируем данные по полу и посчитаем среднее значение переменной writing score

In [7]:
df.groupby('gender').aggregate({'writing score': 'mean'})

Unnamed: 0_level_0,writing score
gender,Unnamed: 1_level_1
female,72.467181
male,63.311203


Отберём данные - 5 наблюдений и 3 первых колонки. Воспользуемся методом iloc (index location)

In [8]:
df.iloc[0:5, 0:3]

Unnamed: 0,gender,race/ethnicity,parental level of education
0,female,group B,bachelor's degree
1,female,group C,some college
2,female,group B,master's degree
3,male,group A,associate's degree
4,male,group C,some college


Выведем значения по индексам строк и столбцов

In [9]:
df.iloc[[0, 3, 10], [0, 5, -1]]

Unnamed: 0,gender,math score,writing score
0,female,72,74
3,male,47,44
10,male,58,52


Назначим более содержательные индексы к строкам

In [10]:
df_with_names = df.iloc[[0, 3, 4, 6]]

df_with_names.index = ['Bob', 'Greg', 'Jon', 'Leo']
df_with_names

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
Bob,female,group B,bachelor's degree,standard,none,72,72,74
Greg,male,group A,associate's degree,free/reduced,none,47,57,44
Jon,male,group C,some college,standard,none,76,78,75
Leo,female,group B,some college,standard,completed,88,95,92


Отберём по индексу

In [11]:
df_with_names.loc[['Greg', 'Jon']]

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
Greg,male,group A,associate's degree,free/reduced,none,47,57,44
Jon,male,group C,some college,standard,none,76,78,75


Отберём серию из DataFrame

In [12]:
type(df_with_names.iloc[:, 0])

pandas.core.series.Series

Создадим pandas Series

In [13]:
my_series_1 = pd.Series([1, 2, 3], index=['Bob', 'Greg', 'Jon'])
my_series_2 = pd.Series([4, 5, 6], index=['Bob', 'Greg', 'Jon'])

Создадим DataFrame из Series

In [14]:
pd.DataFrame({'col_name_1': my_series_1, 'col_name_2': my_series_2})

Unnamed: 0,col_name_1,col_name_2
Bob,1,4
Greg,2,5
Jon,3,6
