# Data Cleanup and Report Generation

We're going to clean up some data! The data here is from a real concert in 2016 (one of our products, [GuestDay](http://guestday.com), is used for registering guests at events). The data has been anonymised and cleaned up, but in the original, raw format--what our app provides after registration.

During this event, the event planners wanted to know the number of attendees for each day, split into 3 categories. We could have done this in Excel by doing some pivot tables each day, but it made more sense to create a Jupyter notebook to do it automatically each day. Here, we'll receive the raw data and generate the table at the end of the notebook.

In [None]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [None]:
df = pd.read_csv("concert.csv")
df.head()

### <font color="red">Exercise 1: Drop some columns</font>

We don't need all the columns. Let's get rid of some, and save the new DataFrame (or do it in-place). Remove the following columns:

* Salutation
* Name/Prefix
* Table
* type and everything after 

Google: [how to drop columns in pandas dataframes](https://encrypted.google.com/search?{google:acceptedSuggestion}oq=how+to+drop+columns+in+pandas+dataframes)

### <font color="red">Exercise 2: Fix the names</font>

* Combine the first and last name into a single field
* Move it to the front (left most column)
* Drop the first and last name fields

Google: Concatenating strings in Python, create new column in pandas, reorder columns in pandas

### <font color="red">Exercise 3: Drop duplicates</font>

There was an error in the data, and those with duplicate **booking ref**s should be dropped. (Doesn't matter which of the duplicates to drop.) Can you drop them and end up with 767 entries?

### <font color="red">Exercise 4: Re-cut the data</font>

Out of all the "affiliation" columns, we only care about 3 types of people: `Premium`, `Advance`, and `General`. Anyone who has the term `Premium` should be categorised there, then anyone who has an `Advance` ticket should be categorised as such, and everyone else is `General`. 

Make a new column called `Category` that only contains this type of categorisation: `Premium`, `Advance`, or `General`.

To do this, you want to use a `str` function we've never seen so far. This [Google query](https://encrypted.google.com/search?{google:acceptedSuggestion}oq=dataframe+string+contains&sourceid=chrome&ie=UTF-8&q=dataframe+string+contains) should help. 

To check that you got it right, `df["Category"].describe()` should give the following:

### <font color="red">Exercise 5: Summarise the data</font>

Get a summary of the totals of the 3 types of ticket holders, counted by attendance.

Things to Google for: groupby, or pivot_table (but there are other ways to do this!). Sample results below: