# Data Preparation with Python: Practice Exercises

The objective of this practice session is to help you learn applying commonly used data preparation steps on the raw dataset in Python.

Suppose it is November 2009 and you wish to sell your copy of MarioKart for the Nintendo Wii on eBay. You are curious how much people is willing to pay for this particular game and therefore you scrape all the eBay auctions from the previous month.

### Data Overview - Mario Kart auctions from eBay

The data, collected in October 2009, has the following variables:

- **`id`**: Numeric, Auction ID assigned by eBay.

- **`duration`**: Numeric, Auction length, in days.

- **`n_bids`**: Numeric, Number of bids.

- **`cond`**: Binary, Game condition, either new or used.

- **`start_pr`**: Numeric, Start price of the auction.

- **`ship_pr`**: Numeric, Shipping price.

- **`total_pr`**: Numeric, Total price, which equals the auction price plus the shipping price.

- **`ship_sp`**: Categorical, Shipping speed or method.

- **`seller_rate`**: Numeric, The seller's rating on eBay. This is the number of positive ratings minus the number of negative ratings for the seller.

- **`stock_photo`**: Binary, Whether the auction feature photo was a stock photo or not. If the picture was used in many auctions, then it was called a stock photo.

- **`wheels`**: Numeric, Number of Wii wheels included in the auction. These are steering wheel attachments to make it seem as though you are actually driving in the game. When used with the controller, turning the wheel actually causes the character on screen to turn.

- **`title`**: String, The title of the auctions.

The modified dataset is available as a CSV file named `mariokart.csv` and can be downloaded from [here](https://raw.githubusercontent.com/imranture/practice_stats/main/datasets/mariokart.csv). 

**Exercise 0**: Read in the dataset as `df` and display the first 10 rows. How many rows and columns are there?

**Exercise 1**:  Display 5 randomly sampled rows and make sure that the variable types match the data descriptions outlined in the Data Overview above.

**Exercise 2**: Generate summary statistics of `df`. 

**Hint**: Use the `describe()` method with `include = np.number` and `include = np.object`.

**Exercise 3**: Plot a histogram of `total_pr` with 30 bins to see the distribution. What do you notice?

**Exercise 4**: Are there any ID-like column(s)? If there are, remove them. Why should we remove ID-like columns?

**Exercise 5**: Are there any missing values?

**Exercise 6**: Identify the row(s) with missing values. 

**Exercise 7**: Remove the row(s) with missing values and make sure there are no missing values left.

**Exercise 8**: `seller_rate` is the number of positive reviews minus the number of negative reviews. Discretize the numerical variable, `seller_rate` and make it a categorical variable with 5 categories: *very low*, *low*, *medium*, *high* and *very high*.

**Exercise 9**: Perform integer encoding for `seller_rate` such that *very low* is 0, *low* is 1, *medium* is 2, *high* is 3 and *very high* is 4.

**Exercise 10**: Use one hot encoding for the remaining categorical features. How many columns do we have now? Why?