<center><h1>Chapter 9 Classification Data</h1></center>

In [2]:
import numpy as np
import pandas as pd

## 1. cat object
### 1. Attributes of cat object
`pandas` provides the `category` type, which enables users to process categorical variables. To convert a normal sequence into a categorical variable, the `astype` method can be used.

In [3]:
df = pd.read_csv('../data/learn_pandas.csv', usecols = ['Grade', 'Name', 'Gender', 'Height', 'Weight'])
s = df.Grade.astype('category')
s.head()

0     Freshman
1     Freshman
2       Senior
3    Sophomore
4    Sophomore
Name: Grade, dtype: category
Categories (4, object): ['Freshman', 'Junior', 'Senior', 'Sophomore']

The cat object is defined in a categorical Series. It is similar to the str object introduced in the previous chapter and defines some attributes and methods to perform categorical operations.

In [4]:
s.cat

<pandas.core.arrays.categorical.CategoricalAccessor object at 0x0000020F9B7A7108>

For a specific classification, there are two components, one is the category itself, which is stored as an `Index` type, and the other is whether it is ordered, both of which can be accessed through the `cat` attribute:

In [5]:
s.cat.categories

Index(['Freshman', 'Junior', 'Senior', 'Sophomore'], dtype='object')

In [6]:
s.cat.ordered

False

Additionally, each category of the sequence is assigned a unique integer number, whose number is determined by the order in cat.categories , which can be accessed via codes :

In [7]:
s.cat.codes.head()

0    0
1    0
2    2
3    3
4    3
dtype: int8

### 2. Add, delete and modify categories

The `categories` attribute of the `cat` object can be used to query categories, so how should the other three operations of "add, modify, query and delete" be performed?

#### 【NOTE】Categories cannot be modified directly
As mentioned in Chapter 3, the index ``Index`` type cannot be modified using ``index_obj[0] = item``, and ``categories`` is stored in ``Index``, so ``pandas`` defines several methods on the ``cat`` attribute to achieve the same purpose.
#### 【END】

First, you can use `add_categories` to add categories:

In [8]:
s = s.cat.add_categories('Graduate') # 增加一个毕业生类别
s.cat.categories

Index(['Freshman', 'Junior', 'Senior', 'Sophomore', 'Graduate'], dtype='object')

If you want to delete a category, you can use `remove_categories`, and all the categories in the original sequence will be set to missing. For example, delete the freshman category:

In [9]:
s = s.cat.remove_categories('Freshman')
s.cat.categories

Index(['Junior', 'Senior', 'Sophomore', 'Graduate'], dtype='object')

In [10]:
s.head()

0          NaN
1          NaN
2       Senior
3    Sophomore
4    Sophomore
Name: Grade, dtype: category
Categories (4, object): ['Junior', 'Senior', 'Sophomore', 'Graduate']

In addition, you can use `set_categories` to directly set the new categories of the sequence. If there are elements in the original categories that do not belong to the new categories, they will be set to missing.

In [11]:
s = s.cat.set_categories(['Sophomore','PhD']) # 新类别为大二学生和博士
s.cat.categories

Index(['Sophomore', 'PhD'], dtype='object')

In [12]:
s.head()

0          NaN
1          NaN
2          NaN
3    Sophomore
4    Sophomore
Name: Grade, dtype: category
Categories (2, object): ['Sophomore', 'PhD']

If you want to remove categories that do not appear in the sequence, you can use `remove_unused_categories` to achieve this:

In [13]:
s = s.cat.remove_unused_categories() # 移除了未出现的博士生类别
s.cat.categories

Index(['Sophomore'], dtype='object')

Finally, the remaining operation in "add, modify, check and delete" is modification, which can be completed through the `rename_categories` method. At the same time, it should be noted that this method will also modify the corresponding values ​​of the original sequence. For example, now change `Sophomore` to the Chinese `Second-year undergraduate student`:

In [14]:
s = s.cat.rename_categories({'Sophomore':'本科二年级学生'})
s.head()

0        NaN
1        NaN
2        NaN
3    本科二年级学生
4    本科二年级学生
Name: Grade, dtype: category
Categories (1, object): ['本科二年级学生']

## 2. Ordered classification
### 1. Establishing order

Ordered categories and unordered categories can be converted to each other through `as_unordered` and `reorder_categories`. It should be noted that the parameter passed in by the latter must be a list of unordered categories of the current sequence. New categories cannot be added, and the original categories cannot be missing. The parameter `ordered=True` must be specified, otherwise the method is invalid. For example, classify the grades into relatively large and small categories, and then restore the unordered state:

In [15]:
s = df.Grade.astype('category')
s = s.cat.reorder_categories(['Freshman', 'Sophomore', 'Junior', 'Senior'],ordered=True)
s.head()

0     Freshman
1     Freshman
2       Senior
3    Sophomore
4    Sophomore
Name: Grade, dtype: category
Categories (4, object): ['Freshman' < 'Sophomore' < 'Junior' < 'Senior']

In [16]:
s.cat.as_unordered().head()

0     Freshman
1     Freshman
2       Senior
3    Sophomore
4    Sophomore
Name: Grade, dtype: category
Categories (4, object): ['Freshman', 'Sophomore', 'Junior', 'Senior']

#### 【NOTE】Categories cannot be modified directly
If you do not want to specify the `ordered=True` parameter, you can first use `s.cat.as_ordered()` to convert to ordered categories, and then use `reorder_categories` to make specific relative size adjustments.
#### 【END】
### 2. Sorting and comparison

In Chapter 2, we mentioned the sorting of string and numeric type sequences. Now we need to explain the sorting of categorical variables: just change the column type to `category`, and then assign the corresponding size relationship, and you can use `sort_index` and `sort_values` normally. For example, to sort the grades:

In [17]:
df.Grade = df.Grade.astype('category')
df.Grade = df.Grade.cat.reorder_categories(['Freshman', 'Sophomore', 'Junior', 'Senior'],ordered=True)
df.sort_values('Grade').head() # 值排序

Unnamed: 0,Grade,Name,Gender,Height,Weight
0,Freshman,Gaopeng Yang,Female,158.9,46.0
105,Freshman,Qiang Shi,Female,164.5,52.0
96,Freshman,Changmei Feng,Female,163.8,56.0
88,Freshman,Xiaopeng Han,Female,164.1,53.0
81,Freshman,Yanli Zhang,Female,165.1,52.0


In [18]:
df.set_index('Grade').sort_index().head() # 索引排序

Unnamed: 0_level_0,Name,Gender,Height,Weight
Grade,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Freshman,Gaopeng Yang,Female,158.9,46.0
Freshman,Qiang Shi,Female,164.5,52.0
Freshman,Changmei Feng,Female,163.8,56.0
Freshman,Xiaopeng Han,Female,164.1,53.0
Freshman,Yanli Zhang,Female,165.1,52.0


Since the order is established, comparison operations can be performed. Comparison operations on categorical variables are divided into two categories. The first is the comparison of the `==` or `!=` relationship. The objects of comparison can be scalars or `Series` (or `list`) of the same length. The second is the comparison of the four types of size relationships: `>, >=, <, <=`. The objects of comparison are similar to the first type, but all elements involved in the comparison must belong to the `categories` of the original sequence and have the same index as the original sequence.

In [19]:
res1 = df.Grade == 'Sophomore'
res1.head()

0    False
1    False
2    False
3     True
4     True
Name: Grade, dtype: bool

In [20]:
res2 = df.Grade == ['PhD']*df.shape[0]
res2.head()

0    False
1    False
2    False
3    False
4    False
Name: Grade, dtype: bool

In [21]:
res3 = df.Grade <= 'Sophomore'
res3.head()

0     True
1     True
2    False
3     True
4     True
Name: Grade, dtype: bool

In [22]:
res4 = df.Grade <= df.Grade.sample(frac=1).reset_index(drop=True) # 打乱后比较
res4.head()

0     True
1     True
2    False
3     True
4     True
Name: Grade, dtype: bool

## 3. Interval category

### 1. Use cut and qcut to construct intervals

Intervals are a special category. In actual data analysis, interval sequences are often constructed using the `cut` and `qcut` methods. These two functions can bin the numerical features of the original sequence, that is, replace the original specific values ​​with interval positions.

First, let's introduce the common usage of `cut`:

Among them, the most important parameter is `bins`. If an integer `n` is passed in, it means that the entire passed array is divided into `n` segments with equal spacing according to the maximum and minimum values. Since the interval is open on the left and closed on the right by default, the minimum value needs to be included in the adjustment. The solution in `pandas` is to subtract `0.001*(max-min)` from the left endpoint of the interval with the smallest value. Therefore, if the sequence `[1,2]` is divided into 2 boxes, the range of the first box is `(0.999,1.5]`, and the range of the second box is `(1.5,2]`. If you need to specify left closing and right opening, you need to set the `right` parameter to `False`. The corresponding interval adjustment method is to add `0.001*(max-min)` to the right endpoint of the interval with the largest value.

In [23]:
s = pd.Series([1,2])
pd.cut(s, bins=2)

0    (0.999, 1.5]
1      (1.5, 2.0]
dtype: category
Categories (2, interval[float64]): [(0.999, 1.5] < (1.5, 2.0]]

In [24]:
pd.cut(s, bins=2, right=False)

0      [1.0, 1.5)
1    [1.5, 2.001)
dtype: category
Categories (2, interval[float64]): [[1.0, 1.5) < [1.5, 2.001)]

Another common use of `bins` is to specify a list of bin split points (using `np.infty` to represent infinity):

In [25]:
pd.cut(s, bins=[-np.infty, 1.2, 1.8, 2.2, np.infty])

0    (-inf, 1.2]
1     (1.8, 2.2]
dtype: category
Categories (4, interval[float64]): [(-inf, 1.2] < (1.2, 1.8] < (1.8, 2.2] < (2.2, inf]]

The other two commonly used parameters are `labels` and `retbins`, which represent the name of the interval and whether to return the split point (not returned by default):

In [26]:
s = pd.Series([1,2])
res = pd.cut(s, bins=2, labels=['small', 'big'], retbins=True)
res[0]

0    small
1      big
dtype: category
Categories (2, object): ['small' < 'big']

In [27]:
res[1] # 该元素为返回的分割点

array([0.999, 1.5  , 2.   ])

In terms of usage, qcut is almost the same as cut, except that the bins parameter is changed to q. The q in qcut refers to quantile. When q is an integer n, it means binning the data according to n equal quantiles. You can also pass in a floating point list to refer to the corresponding quantile split points.

In [28]:
s = df.Weight
pd.qcut(s, q=3).head()

0    (33.999, 48.0]
1      (55.0, 89.0]
2      (55.0, 89.0]
3    (33.999, 48.0]
4      (55.0, 89.0]
Name: Weight, dtype: category
Categories (3, interval[float64]): [(33.999, 48.0] < (48.0, 55.0] < (55.0, 89.0]]

In [29]:
pd.qcut(s, q=[0,0.2,0.8,1]).head()

0      (44.0, 69.4]
1      (69.4, 89.0]
2      (69.4, 89.0]
3    (33.999, 44.0]
4      (69.4, 89.0]
Name: Weight, dtype: category
Categories (3, interval[float64]): [(33.999, 44.0] < (44.0, 69.4] < (69.4, 89.0]]

### 2. Construction of general intervals

For a specific interval, it has three elements, namely the left endpoint, the right endpoint, and the open and closed state of the endpoint. The open and closed state can be specified as one of `right, left, both, neither`:

In [30]:
my_interval = pd.Interval(0, 1, 'right')
my_interval

Interval(0, 1, closed='right')

Its attributes include `mid, length, right, left, closed,`, which represent the midpoint, length, right endpoint, left endpoint, and open and closed state respectively.

Use `in` to determine whether an element belongs to an interval:

In [31]:
0.5 in my_interval

True

Use `overlaps` to determine whether two intervals intersect:

In [32]:
my_interval_2 = pd.Interval(0.5, 1.5, 'left')
my_interval.overlaps(my_interval_2)

True

Generally speaking, `pd.IntervalIndex` objects are generated by four methods, namely `from_breaks, from_arrays, from_tuples, interval_range`, which are used in different situations:

`from_breaks` is similar to `cut` or `qcut`, except that the latter two are calculated split points, while the former is directly passed in custom split points:

In [33]:
pd.IntervalIndex.from_breaks([1,3,6,10], closed='both')

IntervalIndex([[1, 3], [3, 6], [6, 10]],
              closed='both',
              dtype='interval[int64]')

`from_arrays` is a list of the left and right endpoints passed in respectively, which is applicable when there is an intersection and the starting and ending points are known:

In [34]:
pd.IntervalIndex.from_arrays(left = [1,3,6,10], right = [5,4,9,11], closed = 'neither')

IntervalIndex([(1, 5), (3, 4), (6, 9), (10, 11)],
              closed='neither',
              dtype='interval[int64]')

`from_tuples` is passed a list of start and end tuples:

In [35]:
pd.IntervalIndex.from_tuples([(1,5),(3,4),(6,9),(10,11)], closed='neither')

IntervalIndex([(1, 5), (3, 4), (6, 9), (10, 11)],
              closed='neither',
              dtype='interval[int64]')

An arithmetic interval sequence is determined by the starting point, end point, number of intervals and interval length. When three of these quantities are determined, the remaining quantity is determined. The `start, end, periods, freq` parameters in `interval_range` correspond to these four quantities, so that the corresponding interval can be constructed:

In [36]:
pd.interval_range(start=1,end=5,periods=8)

IntervalIndex([(1.0, 1.5], (1.5, 2.0], (2.0, 2.5], (2.5, 3.0], (3.0, 3.5], (3.5, 4.0], (4.0, 4.5], (4.5, 5.0]],
              closed='right',
              dtype='interval[float64]')

In [37]:
pd.interval_range(end=5,periods=8,freq=0.5)

IntervalIndex([(1.0, 1.5], (1.5, 2.0], (2.0, 2.5], (2.5, 3.0], (3.0, 3.5], (3.5, 4.0], (4.0, 4.5], (4.5, 5.0]],
              closed='right',
              dtype='interval[float64]')

#### 【Practice】
Both `interval_range` and `date_range` in the next chapter of time series are given three of the four elements in the arithmetic sequence to determine the entire sequence. Please review the relationship between the first term, last term, number of terms and common difference in the arithmetic sequence, and write the identity relationship between the four parameters in `interval_range`.
#### 【END】
In addition, if you directly use `pd.IntervalIndex([...], closed=...)` and pass the list of `Interval` type into it to convert it into an interval index, then all intervals will be forced to be converted to the specified `closed` type, because `pd.IntervalIndex` only allows the storage of `Interval` objects of the same open and closed intervals.

In [38]:
my_interval

Interval(0, 1, closed='right')

In [39]:
my_interval_2

Interval(0.5, 1.5, closed='left')

In [40]:
pd.IntervalIndex([my_interval, my_interval_2], closed='left')

IntervalIndex([[0.0, 1.0), [0.5, 1.5)],
              closed='left',
              dtype='interval[float64]')

### 3. Interval properties and methods

`IntervalIndex` also defines some useful properties and methods. At the same time, if you want to use the results of `cut` or `qcut` for analysis, you need to convert it to this index type first:

In [41]:
id_interval = pd.IntervalIndex(pd.cut(s, 3))
id_interval[:3]

IntervalIndex([(33.945, 52.333], (52.333, 70.667], (70.667, 89.0]],
              closed='right',
              name='Weight',
              dtype='interval[float64]')

Similar to a single `Interval` type, `IntervalIndex` has several common properties: `left, right, mid, length`, which represent the left and right endpoints, the mean of the two endpoints, and the length of the interval respectively.

In [42]:
id_demo = id_interval[:5] # 选出前5个展示
id_demo

IntervalIndex([(33.945, 52.333], (52.333, 70.667], (70.667, 89.0], (33.945, 52.333], (70.667, 89.0]],
              closed='right',
              name='Weight',
              dtype='interval[float64]')

In [43]:
id_demo.left

Float64Index([33.945, 52.333, 70.667, 33.945, 70.667], dtype='float64')

In [44]:
id_demo.right

Float64Index([52.333, 70.667, 89.0, 52.333, 89.0], dtype='float64')

In [45]:
id_demo.mid

Float64Index([43.138999999999996, 61.5, 79.8335, 43.138999999999996, 79.8335], dtype='float64')

In [46]:
id_demo.length

Float64Index([18.387999999999998, 18.334000000000003, 18.333,
              18.387999999999998, 18.333],
             dtype='float64')

`IntervalIndex` also has two commonly used methods, including `contains` and `overlaps`, which respectively refer to determining whether each interval contains an element one by one and whether it intersects with a `pd.Interval` object.

In [47]:
id_demo.contains(4)

array([False, False, False, False, False])

In [48]:
id_demo.overlaps(pd.Interval(40,60))

array([ True,  True, False,  True, False])

## 4. Exercises
### Ex1: Counting categories that do not appear

In Chapter 5, we introduced the `crosstab` function, which can summarize the frequency of combinations of two columns under the default parameters:

In [49]:
df = pd.DataFrame({'A':['a','b','c','a'], 'B':['cat','cat','dog','cat']})
pd.crosstab(df.A, df.B)

B,cat,dog
A,Unnamed: 1_level_1,Unnamed: 2_level_1
a,2,0
b,1,0
c,0,1


But in fact, some columns store categorical variables, which do not necessarily contain all categories. If you want to summarize these categories that do not appear in the `crosstab` result, you can specify the `dropna` parameter to `False`:

In [50]:
df.B = df.B.astype('category').cat.add_categories('sheep')
pd.crosstab(df.A, df.B, dropna=False)

B,cat,dog,sheep
A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,2,0,0
b,1,0,0
c,0,1,0


Please implement a `my_crosstab` function with a `dropna` parameter to complete the above function.

### Ex2: Diamond Dataset

There is a dataset about diamonds, where `carat, cut, clarity, price` represent carat weight, cut quality, clarity and price respectively. The sample is as follows:

In [51]:
df = pd.read_csv('../data/diamonds.csv') 
df.head(3)

Unnamed: 0,carat,cut,clarity,price
0,0.23,Ideal,SI2,326
1,0.21,Premium,SI1,326
2,0.23,Good,VS1,327


1. Use the `nunique` function on `df.cut` in the `object` type and the `category` type respectively, and compare their performance.
2. Diamond cut quality can be divided into five levels, from low to high, namely `Fair, Good, Very Good, Premium, Ideal`, and clarity has eight levels, from low to high, namely `I1, SI2, SI1, VS2, VS1, VVS2, VVS1, IF`. Please sort the cut quality in the order of **good to low**. Diamonds of the same cut quality should be sorted in the order of **low to high** according to clarity.
3. Use two different methods to map the two columns `cut, clarity` to integers from 0 to n-1 in the order of **good to low**, where n represents the number of categories.
4. Bin the price per carat according to the quantile (q=\[0.2, 0.4, 0.6, 0.8\]) and the \[1000, 3500, 5500, 18000\] cut point to obtain five categories `Very Low, Low, Mid, High, Very High`, and add the `category` sequences obtained by these two binning methods to the original table in sequence.
5. In question 4, do all the categories appear in the sequence obtained by integer binning? If there arePlease delete the category that does not appear.
6. For the sequence obtained by binning according to quantiles in question 4, find the left and right endpoint values ​​and length of the interval corresponding to each sample.