# Basic About Data

Data sets are made up of data objects. A data object represents an entity.
Examples: 
- sales database:  customers, store items, sales
- medical database: patients, treatments
- university database: students, professors, courses

In some other literatures, it is also called samples , examples, instances, data points, objects, tuples.
Data objects are described by attributes.
Hence, 
- Database rows -> data objects; 
- columns ->attributes.

|  | Attribute 01 | Attribute 02 | Attribute 03 |
|--|--------------|--------------|--------------|
|Object 1|     x11   | x12 |x13 |
|Object 2|     x21   | x22 | x23 |
|Object 3|     x31   | x32 | x33 |

For example: 

| Index  | Customer_ID | Name | Age |
|--|--------------|--------------|--------------|
| 1 |     C_01   | Anna |33 |
| 2 |     C_02   | Bob | 27 |
| 3 |     C_03   | Charlie | 45 |

Attribute (or dimensions, features, variables): a data field, representing a characteristic or feature of a data object.
E.g., customer _ID, name, address
Types:
- Categorical 
  - Ordinal
  - Nominal
  - Binary
- Continuous (Numerical)
  - Interval-scaled
  - Ratio-scaled

## Categorical attributes (variables):
- Take on a small set of possible values
- Typically qualitative (sometimes called as “discrete” / “qualitative” variables)
- Information that can be sorted into categories
- Types of categorical attributes– 
  - ordinal, 
  - nominal  
  - dichotomous (binary)


### Ordinal — a categorical attribute with some intrinsic order or numeric value

|  Attribute Name    | Attribute Values     |
|------|------|
| Agreement     | Strongly disagree, disagree, neutral, agree, strongly agree       |
| Rating | Excellent, Good, Fair, Poor |
| Frequency | Always, Often, Sometimes, Never |
| Any other scale | On a scale of 1 to 5|

### Nominal – a categorical attribute without an intrinsic order

| Attribute Name | Attribute Values |
|------ | ------|
| City in Korea | Seoul, Busan, Daegu |
| Gender | Male, Female |
| Nationality | American, Mexican, French, Korean |
| Favorite Pet | Dog, Cat, Fish, Snake |
| Blood Type | A, B, O, AB |

### Dichotomous (or binary) attributes – a categorical attribute with only 2 levels of categories
- Often represents the answer to a yes or no question

- **Symmetric binary**: both outcomes equally important
  - e.g., gender
- **Asymmetric binary**: outcomes not equally important.  
  - e.g., medical test (positive vs. negative)
  - Convention: assign 1 to most important outcome (e.g., HIV positive)
  
| Attribute Name | Attibute Values |
| ----- | -----|
| Gender | Male, Female |
| Attend Party | Yes, No |
| Join Conference | Yes, No |
| Medical Test | Positive, Negative |

## Continuous (numerical) attributes:
- Always numeric or quantitative
- Can be any number, positive or negative
- Examples: age in years, weight, blood pressure readings, temperature, concentrations of pollutants and other measurements
- Type of continuous attributes:
  - Interval
  - ratio

### Interval - Continuous Attributes

Interval data is like ordinal except we can say the intervals between each value are equally split. 
- Measured on a scale of equal-sized units
- Values have order
  - E.g., temperature in C˚or F˚, calendar dates
- No true zero-point

- The difference between 20 and 21 degree Celsius is the same with the difference between 34 and 35 degree Celsius.


### Ratio - Continuous Attributes

Ratio data is interval data with a natural zero point.
Inherent zero-point
- We can speak of values as being an order of magnitude larger than the unit of measurement (10 K˚ is twice as high as 5 K˚).
  - e.g., temperature in Kelvin, length, counts, monetary quantities

| Variable Name | Variable Values |
|------ | ----- |
| Degree in Kelvin | 0 |
| Age | Start with 0 |
| Length | Start with 0 |

<p style="font-size:30px"> Learning Check

- How many attributes types in a data?
- Can you know the difference between the categorical attributes and continuous attributes?
- What is the difference among ordinal, nominal, and binary?
- What is the difference between interval and ratio?


# Pandas Library 

The pandas library is open source and has a solid community backing it. It is used in data analysis and data science as a data manipulation tool. In this section, we will learn the basic exploration data statistics using Pandas Library.

First, we need to import pandas.


In [84]:
import numpy as np
import pandas as pd


There are many methods in the pandas library. To run the autocomplete from Jupyter Notebook, you can press tab after typing "pd."

Data sources vary according to the domain. For example, the data can be extracted from a database or data warehouse (using Structured Query Language). In other domain, we can extract the data using unstructured data document such as JSON file format. However, many data analysis perform the analysis using tabular format file such as CSV (comma separated values), Excel, tab separated values (TSV), text file, etc. 

Some methods in Pandas to extract data into a tabular format is as follows: 

- ```read_json``` is a method to read JSON file.

- ```read_sql_query``` is a method to read data using Query language and extract data from Database.

- ```read_csv``` is a method to read CSV file. CSV (Comma separated values) is a common file for placing the data due to the small size and efficiency in reading and storing process. There are some other variants with semicolon (;), space (" "), tab ("    "), etc. 

Note that you can also write the data into some of the file format. The details are as follows.


|Format Type | Data Description |  Reader | Writer |
|--------- | -------- | ------- | ------ |
| text | CSV | read_csv | to_csv |
| text | Fixed-Width Text File | read_fwf | | 
| text | JSON | read_json | to_json |
| text | HTML | read_html | to_html |
| text | Local clipboard | read_clipboard | to_clipboard |
| | MS Excel | read_excel | to_excel | 
|binary | OpenDocument | read_excel | 
| binary | HDF5 Format | read_hdf | to_hdf
| binary | Feather Format | read_feather | to_feather |
| binary | Parquet Format | read_parquet | to_parquet |
| binary | ORC Format | read_orc | 
|binary | Msgpack | read_msgpack | to_msgpack |
| binary | Stata | read_stata | to_stata |
| binary | SAS | read_sas | binary |
| | SPSS | read_spss |
| binary | Python Pickle Format | read_pickle | to_pickle |
| SQL | SQL | read_sql | to_sql |
| SQL | Google BigQuery | read_gbq | to_gbq |


For this course, we will focus mostly on CSV file.

In [85]:
#you can omit the 'sep' parameter if the dataset use comma as the separator
data = pd.read_csv('data/diamonds.csv', sep = ",") #

If the separator is not a comma, you can specify in the ```sep```. The default is comma (","), hence, you can just call the file name without specifying the separator.

In [86]:
#Basic info of the data
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   carat    53940 non-null  float64
 1   cut      53940 non-null  object 
 2   color    53940 non-null  object 
 3   clarity  53940 non-null  object 
 4   depth    53940 non-null  float64
 5   table    53940 non-null  float64
 6   price    53940 non-null  int64  
 7   x        53940 non-null  float64
 8   y        53940 non-null  float64
 9   z        53940 non-null  float64
dtypes: float64(6), int64(1), object(3)
memory usage: 4.1+ MB


Data Dictionary.

In many publicly available dataset, they provide the data dictionary. It means it is an explanation of the dataset with regard to the attributes, the data types, and definitions.

| Attributes |	Type	| Definition |
| ------- | ------  | ----------- |
| carat |	numerical	| Weight of diamond in carats (1 carat = 200mg) |
| cut	| categorical |	to define the cut type |
| color |	categorical |	GIA color scale, standardized for grading |
| clarity |	categorical |	GIA visibility scale, number and size of inclusions |
| depth |	numerical |	see figure |
| table |	numerical|	see figure |
| price |	numerical |	The diamond price |
| x |	numerical |	see figure |
| y |	numerical|	see figure |
| z |	numerical |	see figure |

<img src="figures/[03-02]_Figure_01_diamonds.jpg">

In [87]:
# head: to see the first 5 data
data.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [88]:
# tail: to see the last 5 data
data.tail()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
53935,0.72,Ideal,D,SI1,60.8,57.0,2757,5.75,5.76,3.5
53936,0.72,Good,D,SI1,63.1,55.0,2757,5.69,5.75,3.61
53937,0.7,Very Good,D,SI1,62.8,60.0,2757,5.66,5.68,3.56
53938,0.86,Premium,H,SI2,61.0,58.0,2757,6.15,6.12,3.74
53939,0.75,Ideal,D,SI2,62.2,55.0,2757,5.83,5.87,3.64


There are many other parameters on the ```read_csv``` method. For example, we can choose the index column as follows:

```df = pd.read_csv('data/diamonds.csv', index_col=0)```

If we intend to set a column to be the index, we can write ```index_col``` and mention the column location for the index

In [89]:
data2 = pd.read_csv('data/diamonds.csv', index_col=1)

In [90]:
data2.head()

Unnamed: 0_level_0,carat,color,clarity,depth,table,price,x,y,z
cut,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Ideal,0.23,E,SI2,61.5,55.0,326,3.95,3.98,2.43
Premium,0.21,E,SI1,59.8,61.0,326,3.89,3.84,2.31
Good,0.23,E,VS1,56.9,65.0,327,4.05,4.07,2.31
Premium,0.29,I,VS2,62.4,58.0,334,4.2,4.23,2.63
Good,0.31,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [91]:
data2

Unnamed: 0_level_0,carat,color,clarity,depth,table,price,x,y,z
cut,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Ideal,0.23,E,SI2,61.5,55.0,326,3.95,3.98,2.43
Premium,0.21,E,SI1,59.8,61.0,326,3.89,3.84,2.31
Good,0.23,E,VS1,56.9,65.0,327,4.05,4.07,2.31
Premium,0.29,I,VS2,62.4,58.0,334,4.20,4.23,2.63
Good,0.31,J,SI2,63.3,58.0,335,4.34,4.35,2.75
...,...,...,...,...,...,...,...,...,...
Ideal,0.72,D,SI1,60.8,57.0,2757,5.75,5.76,3.50
Good,0.72,D,SI1,63.1,55.0,2757,5.69,5.75,3.61
Very Good,0.70,D,SI1,62.8,60.0,2757,5.66,5.68,3.56
Premium,0.86,H,SI2,61.0,58.0,2757,6.15,6.12,3.74


In [92]:
#it shows the values in an ndarray format
data.values

array([[0.23, 'Ideal', 'E', ..., 3.95, 3.98, 2.43],
       [0.21, 'Premium', 'E', ..., 3.89, 3.84, 2.31],
       [0.23, 'Good', 'E', ..., 4.05, 4.07, 2.31],
       ...,
       [0.7, 'Very Good', 'D', ..., 5.66, 5.68, 3.56],
       [0.86, 'Premium', 'H', ..., 6.15, 6.12, 3.74],
       [0.75, 'Ideal', 'D', ..., 5.83, 5.87, 3.64]], dtype=object)

In [93]:
#it shows the index values
data.index

RangeIndex(start=0, stop=53940, step=1)

In [94]:
data.columns

Index(['carat', 'cut', 'color', 'clarity', 'depth', 'table', 'price', 'x', 'y',
       'z'],
      dtype='object')

In [95]:
#it shows the data shape (# of rows, # of columns)
data.shape

(53940, 10)

In [96]:
#it shows the summary of the data (only numerical attributes)
data.describe()

Unnamed: 0,carat,depth,table,price,x,y,z
count,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0
mean,0.79794,61.749405,57.457184,3932.799722,5.731157,5.734526,3.538734
std,0.474011,1.432621,2.234491,3989.439738,1.121761,1.142135,0.705699
min,0.2,43.0,43.0,326.0,0.0,0.0,0.0
25%,0.4,61.0,56.0,950.0,4.71,4.72,2.91
50%,0.7,61.8,57.0,2401.0,5.7,5.71,3.53
75%,1.04,62.5,59.0,5324.25,6.54,6.54,4.04
max,5.01,79.0,95.0,18823.0,10.74,58.9,31.8


In [97]:
# both categorical and continous (numerical) data are shown
data.describe(include = 'all')

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
count,53940.0,53940,53940,53940,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0
unique,,5,7,8,,,,,,
top,,Ideal,G,SI1,,,,,,
freq,,21551,11292,13065,,,,,,
mean,0.79794,,,,61.749405,57.457184,3932.799722,5.731157,5.734526,3.538734
std,0.474011,,,,1.432621,2.234491,3989.439738,1.121761,1.142135,0.705699
min,0.2,,,,43.0,43.0,326.0,0.0,0.0,0.0
25%,0.4,,,,61.0,56.0,950.0,4.71,4.72,2.91
50%,0.7,,,,61.8,57.0,2401.0,5.7,5.71,3.53
75%,1.04,,,,62.5,59.0,5324.25,6.54,6.54,4.04


In [98]:
#describe only numerical data
data.describe(include=[np.number])

Unnamed: 0,carat,depth,table,price,x,y,z
count,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0
mean,0.79794,61.749405,57.457184,3932.799722,5.731157,5.734526,3.538734
std,0.474011,1.432621,2.234491,3989.439738,1.121761,1.142135,0.705699
min,0.2,43.0,43.0,326.0,0.0,0.0,0.0
25%,0.4,61.0,56.0,950.0,4.71,4.72,2.91
50%,0.7,61.8,57.0,2401.0,5.7,5.71,3.53
75%,1.04,62.5,59.0,5324.25,6.54,6.54,4.04
max,5.01,79.0,95.0,18823.0,10.74,58.9,31.8


In [99]:
#describe only string column data
data.describe(include=[np.object])

Unnamed: 0,cut,color,clarity
count,53940,53940,53940
unique,5,7,8
top,Ideal,G,SI1
freq,21551,11292,13065


In [100]:
#describe data excluding numerical data
data.describe(exclude=[np.number])

Unnamed: 0,cut,color,clarity
count,53940,53940,53940
unique,5,7,8
top,Ideal,G,SI1
freq,21551,11292,13065


In [101]:
#describe data excluding string column data
data.describe(exclude=[np.object])

Unnamed: 0,carat,depth,table,price,x,y,z
count,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0
mean,0.79794,61.749405,57.457184,3932.799722,5.731157,5.734526,3.538734
std,0.474011,1.432621,2.234491,3989.439738,1.121761,1.142135,0.705699
min,0.2,43.0,43.0,326.0,0.0,0.0,0.0
25%,0.4,61.0,56.0,950.0,4.71,4.72,2.91
50%,0.7,61.8,57.0,2401.0,5.7,5.71,3.53
75%,1.04,62.5,59.0,5324.25,6.54,6.54,4.04
max,5.01,79.0,95.0,18823.0,10.74,58.9,31.8


## Pandas Data Structure

There are two data structures:
- Series (One-dimensional data)
- DataFrame (Multi-dimensional data)





<img src="figures/[03-02]_Figure_02_Series_DF.jpg">

### Series

A pandas Series is one-dimensional. It is an array that can hold any data type and it has a labeled axis, referred to as the index. Although similar, a Series has differences from a numpy array.

In [102]:
first = data['carat']

In [103]:
first.head()

0    0.23
1    0.21
2    0.23
3    0.29
4    0.31
Name: carat, dtype: float64

In [104]:
first

0        0.23
1        0.21
2        0.23
3        0.29
4        0.31
         ... 
53935    0.72
53936    0.72
53937    0.70
53938    0.86
53939    0.75
Name: carat, Length: 53940, dtype: float64

In [105]:
# read the data at index 53935
first[53935]

0.72

In [106]:
#get the first 1000 data
take1000 = first[:1000]

In [107]:
take1000.head()

0    0.23
1    0.21
2    0.23
3    0.29
4    0.31
Name: carat, dtype: float64

In [108]:
take1000.shape # shape는 차원 나타내줌 - (row,column)

(1000,)

In [109]:
#take 500 data from index 500 to 1000
take500 = first[500:1000]

In [110]:
take500.shape

(500,)

### Aggregation functions

Series provides many other aggregation functions, but we won't discuss them in detail here.
Assuming there are no ``NaN`` in the data, the following table provides a list of useful statistical / aggregation functions. 
[ref for Pandas Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html)

|Function Name       |    Description                                   |
|------------------- |---------------------  -------------------------------|
| ``Series.count``            |  Count the number of elements                |
| ``Series.mean``            |  Compute mean of elements                     |
| ``Series.sum``            |  Compute sum of elements                       |
| ``Series.cumsum``         |  Cumulative sum of a Series                    |
| ``Series.cumprod``        |  Cumulative product of a Series                |
| ``Series.cummin``         |  Cumulative minimum of a Series                |
| ``Series.cummax``         |  Cumulative maximum of a Series                |
| ``Series.nonzero``        |  Create an array of non-zero element           |
| ``Series.ne``             |  Show the value which is not equal to a parameter          |
| ``Series.eq``             |  Show the value which is equal to a parameter              |
| ``Series.abs``            |  Show the value which is the absolute of those values        |
| ``Series.ge``  |  Show the index which the value is greater than or equal to             |
| ``Series.min``     |  Show the value which is the minimum       |
| ``Series.max``            |  Show the value which is the maximum         |


We will see these aggregates often throughout the rest of the book.

In [111]:
#if it is a Series (a single column), you can use the aggregate function directly
take1000.sum()

689.28

In [112]:
#You can run the aggregate function on a single selected column
data['carat'].mean()

0.7979397478679852

In [113]:
first.mean()

0.7979397478679852

### DataFrame

A pandas DataFrame stores data in a tabular format with integrated indexing. That means each column has a column name and each row has a row index. All of the examples will be using pandas DataFrames.

In [114]:
#let first be the data 'carat' and second be the data 'price'
first = data['carat']
second = data['price']

In [115]:
#checking the data
first.head()

0    0.23
1    0.21
2    0.23
3    0.29
4    0.31
Name: carat, dtype: float64

In [116]:
second.head()

0    326
1    326
2    327
3    334
4    335
Name: price, dtype: int64

In [117]:
#this code will create a data frame with the "first" as the column and "second" as the index
new_data = pd.DataFrame(first, second)

In [118]:
new_data.head()

Unnamed: 0_level_0,carat
price,Unnamed: 1_level_1
326,0.53
326,0.53
327,0.72
334,0.71
335,0.71


In [119]:
#create a dictionary {key: value} to define attribute name (key) and attribute value (value) 
# of the Data Frame
new_data1 = pd.DataFrame({'carat' : first, 'price': second})

In [120]:
new_data1.head()

Unnamed: 0,carat,price
0,0.23,326
1,0.21,326
2,0.23,327
3,0.29,334
4,0.31,335


In [121]:
new_data2 = pd.DataFrame({'price':second,'carat':first})

In [122]:
new_data2.head()

Unnamed: 0,price,carat
0,326,0.23
1,326,0.21
2,327,0.23
3,334,0.29
4,335,0.31


### How to describe the Categorical Data

```.value_counts()``` can tell us the frequency of all values in a column

In [123]:
#revisit on showing the categorical attributes
data.describe(include = [np.object])

Unnamed: 0,cut,color,clarity
count,53940,53940,53940
unique,5,7,8
top,Ideal,G,SI1
freq,21551,11292,13065


In [124]:
#transpose the table to show the data properties(전치행렬-transpose)
data.describe(include = [np.object]).transpose()

Unnamed: 0,count,unique,top,freq
cut,53940,5,Ideal,21551
color,53940,7,G,11292
clarity,53940,8,SI1,13065


In [125]:
#to check the count of each categorical data
data['cut'].value_counts()

Ideal        21551
Premium      13791
Very Good    12082
Good          4906
Fair          1610
Name: cut, dtype: int64

There will be other ways to handle categorical attributes in the later section.

### Indexers: loc, iloc, and ix

These slicing and indexing conventions can be a source of confusion.
For example, if your ``Series`` has an explicit integer index, an indexing operation such as ``data[1]`` will use the explicit indices, while a slicing operation like ``data[1:3]`` will use the implicit Python-style index.

In [126]:
take1000[1:3]

1    0.21
2    0.23
Name: carat, dtype: float64

In [127]:
data[1:3]

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31


In [128]:
take1000.loc[1:3] # 첫번째 row부터 이름이 3인 row까지 출력(label을 통해 값을 찾음-변수명을 사용해 데이터프레임 분리)
# loc[index_label]

1    0.21
2    0.23
3    0.29
Name: carat, dtype: float64

In [129]:
take1000.iloc[1:3] # 1,2번째 row까지 출력( (integer position을 통해 값을 찾음)-인덱스 번호를 사용해 데이터프레임 분리 )

1    0.21
2    0.23
Name: carat, dtype: float64

In [130]:
#check the current index of the data
data.index

RangeIndex(start=0, stop=53940, step=1)

In [131]:
#set the new index with the value of column price
data.set_index("price", inplace = True)

In [132]:
#check the index
data.index

Int64Index([ 326,  326,  327,  334,  335,  336,  336,  337,  337,  338,
            ...
            2756, 2756, 2757, 2757, 2757, 2757, 2757, 2757, 2757, 2757],
           dtype='int64', name='price', length=53940)

In [133]:
#check the data shape
data.shape

(53940, 9)

In [134]:
data.head()

Unnamed: 0_level_0,carat,cut,color,clarity,depth,table,x,y,z
price,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
326,0.23,Ideal,E,SI2,61.5,55.0,3.95,3.98,2.43
326,0.21,Premium,E,SI1,59.8,61.0,3.89,3.84,2.31
327,0.23,Good,E,VS1,56.9,65.0,4.05,4.07,2.31
334,0.29,Premium,I,VS2,62.4,58.0,4.2,4.23,2.63
335,0.31,Good,J,SI2,63.3,58.0,4.34,4.35,2.75


In [135]:
#this command will go to error since there is no index 1 to 5
data.loc[1:5]

KeyError: 1

In [136]:
#this command will refer to the Python reference index
#while the index has become price
data.iloc[1:5]

Unnamed: 0_level_0,carat,cut,color,clarity,depth,table,x,y,z
price,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
326,0.21,Premium,E,SI1,59.8,61.0,3.89,3.84,2.31
327,0.23,Good,E,VS1,56.9,65.0,4.05,4.07,2.31
334,0.29,Premium,I,VS2,62.4,58.0,4.2,4.23,2.63
335,0.31,Good,J,SI2,63.3,58.0,4.34,4.35,2.75


### Rows and Columns Selection

In [137]:
data.loc[[326, 335]]

Unnamed: 0_level_0,carat,cut,color,clarity,depth,table,x,y,z
price,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
326,0.23,Ideal,E,SI2,61.5,55.0,3.95,3.98,2.43
326,0.21,Premium,E,SI1,59.8,61.0,3.89,3.84,2.31
335,0.31,Good,J,SI2,63.3,58.0,4.34,4.35,2.75


In [138]:
#get the data of index "price" equals to 326 and 335
# and show the column carat and cut
data.loc[[326,335], ['carat', 'cut']]

Unnamed: 0_level_0,carat,cut
price,Unnamed: 1_level_1,Unnamed: 2_level_1
326,0.23,Ideal
326,0.21,Premium
335,0.31,Good


In [139]:
#select all data which the cut equals to Ideal
data.loc[data['cut'] == "Ideal"]

Unnamed: 0_level_0,carat,cut,color,clarity,depth,table,x,y,z
price,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
326,0.23,Ideal,E,SI2,61.5,55.0,3.95,3.98,2.43
340,0.23,Ideal,J,VS1,62.8,56.0,3.93,3.90,2.46
344,0.31,Ideal,J,SI2,62.2,54.0,4.35,4.37,2.71
348,0.30,Ideal,I,SI2,62.0,54.0,4.31,4.34,2.68
403,0.33,Ideal,I,SI2,61.8,55.0,4.49,4.51,2.78
...,...,...,...,...,...,...,...,...,...
2756,0.79,Ideal,I,SI1,61.6,56.0,5.95,5.97,3.67
2756,0.71,Ideal,E,SI1,61.9,56.0,5.71,5.73,3.54
2756,0.71,Ideal,G,VS1,61.4,56.0,5.76,5.73,3.53
2757,0.72,Ideal,D,SI1,60.8,57.0,5.75,5.76,3.50


In [140]:
data[data['cut'] == 'Ideal']

Unnamed: 0_level_0,carat,cut,color,clarity,depth,table,x,y,z
price,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
326,0.23,Ideal,E,SI2,61.5,55.0,3.95,3.98,2.43
340,0.23,Ideal,J,VS1,62.8,56.0,3.93,3.90,2.46
344,0.31,Ideal,J,SI2,62.2,54.0,4.35,4.37,2.71
348,0.30,Ideal,I,SI2,62.0,54.0,4.31,4.34,2.68
403,0.33,Ideal,I,SI2,61.8,55.0,4.49,4.51,2.78
...,...,...,...,...,...,...,...,...,...
2756,0.79,Ideal,I,SI1,61.6,56.0,5.95,5.97,3.67
2756,0.71,Ideal,E,SI1,61.9,56.0,5.71,5.73,3.54
2756,0.71,Ideal,G,VS1,61.4,56.0,5.76,5.73,3.53
2757,0.72,Ideal,D,SI1,60.8,57.0,5.75,5.76,3.50


In [141]:
#select all data which the cut equals to Ideal
#and show only column carat, cut, and color.
data.loc[data['cut'] == 'Ideal',['carat', 'cut', 'color']]

Unnamed: 0_level_0,carat,cut,color
price,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
326,0.23,Ideal,E
340,0.23,Ideal,J
344,0.31,Ideal,J
348,0.30,Ideal,I
403,0.33,Ideal,I
...,...,...,...
2756,0.79,Ideal,I
2756,0.71,Ideal,E
2756,0.71,Ideal,G
2757,0.72,Ideal,D


In [142]:
#select all data which the cut equals to Ideal
#and the carat is greater than 0.5
#and show only the column carat, cut, and color
data.loc[(data['cut'] == 'Ideal') & (data['carat'] > 0.5),['carat', 'cut', 'color']]

Unnamed: 0_level_0,carat,cut,color
price,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2757,0.70,Ideal,E
2757,0.70,Ideal,G
2760,0.74,Ideal,G
2760,0.80,Ideal,I
2760,0.75,Ideal,G
...,...,...,...
2756,0.79,Ideal,I
2756,0.71,Ideal,E
2756,0.71,Ideal,G
2757,0.72,Ideal,D


# Summary

In this chapter, we performed Pandas library to read the data and use Series and DataFrame to read, summarize, and manipulate the data.