*UCIML datasets from their API, Kernel machines, Barplot with percentages annotations, pie charts, Imbalance ratio*

![](designer.jpeg)

# Daily Note - 03/06/2024

## 1. Downloading UCIML datasets from their API

```python
!pip3 install -U ucimlrepo 
```

Documentation about the API [here](https://github.com/uci-ml-repo/ucimlrepo)


## 2. Barplot with percentages annotations

A barplot with percentages annotations can be very useful to show the distribution of a categorical variable. 

An easy way to do this is using pandas and matplotlib. 

The first thing is to create a `fig`, `ax` object with matplotlib and specify the number of subplots that we are going to use.

```python
fig, ax = plt.subplots(1, 2, figsize=(12, 6))
```

Then, we can use the `value_counts` method from pandas to get the counts of each category in the variable. 

```python
yo.value_counts(normalize=True).plot(kind='barh', ax=ax[0])
```

Because we are ploting two plots we are going to do the same with the other one.
```python
df.Target.value_counts(normalize=True).plot(kind='barh', ax=ax[1])
``` 

We can use the `set_title` method from matplotlib to add a title to each plot

```python
ax[0].set_title('Original Dataset')
ax[1].set_title('Train Dataset')
```

Here is the most oscure part of the code. First we need to iterate through the axes by `for axis in ax`. This line iterates over the axes in the `ax` array. `ax` is an array of the axes object created by `plt.subplots`. So contains the two axes that we created, `ax[0]` and `ax[1]`. After this we need to iterate through the patches of the barplot. Each bar in the bar plot is represented as a patch in matplotlib. `axis.patches` contains all the bars in the current axis. 

```python
for axis in ax:
    for p in axis.patches:
```
To calculate the percentages, we can retrieves the width of the bar, which corresponds to the normalized value using `p.get_width()`. `f'{p.get_width() * 100:.2f}%'` creates a formatted string representing the percentage with a '%' symbol.
`perc = f'{p.get_width() * 100:.2f}%'`

In order to add the percentage to the bar, we can use `axis.annotate` method. This method adds an annotation to the plot. `(p.get_width(), p.get_y() + p.get_height() / 2)` sets the position of the annotation. `p.get_width()` is the x-coordinate of the annotation, placing it at the end of the bar. `p.get_y() + p.get_height() / 2` is the y-coordinate of the annotation, centering it vertically within the bar. `ha='left'` sets the horizontal alignment to the left.
`va='center'` sets the vertical alignment to the center. `xytext=(-50, 0)` specifies an offset for the text position (moving the text 50 points to the left of the bar). `textcoords='offset points'` tells annotate to interpret the xytext offset as being measured in points from the (x, y) position.

```python
        axis.annotate(perc, (p.get_width(), p.get_y() + p.get_height() / 2),
                       ha='left', va='center', xytext=(-50, 0), textcoords='offset points')
```

The final code is:

```python
fig, ax = plt.subplots(1, 2, figsize=(14, 6))
yo.value_counts(normalize=True).plot(kind='barh', ax=ax[0], color='skyblue')
ax[0].set_title('Original Dataset')
df.Target.value_counts(normalize=True).plot(kind='barh', ax=ax[1], color='skyblue')
ax[1].set_title('Train Dataset')
for axis in ax:
    for p in axis.patches:
        perc = f'{p.get_width() * 100:.2f}%'
        axis.annotate(perc, (p.get_width(), p.get_y() + p.get_height() / 2), ha='left', va='center', xytext=(-50, 0), textcoords='offset points')
plt.tight_layout()
plt.show()
```

![](target_distribution.png)
 

 ## 3. Don't use pie charts

 The main issue is that humans are generally not good at accurately estimating areas and angles. In a 2D pie chart, the sie of each slice is determined by the angle and the area it occupies, which can be hard to compare precisely. A bar chart represent data with the length of the bars along a common baseline, which makes it easier to compare the values. This linear representation (lenght of bars) is more intuitive and allows for more accurate visual comparisons than the angular or area-based representations of a pie chart.


## 4. Imbalance ratio

To assess the imbalance more precisely, one common approach is to calculate the imbalance ratio. This ratio can be computed as the number of instances of the majority class divided by the number of instances of the minority class.
Generally, an imbalance ratio of less than 5:1 is considered mild to moderate, while ratios greater than 10:1 are considered severe.

