___

<a href='https://www.instagram.com/lanlearning/'> <img src='../pimages/logosmall.png' width="100" height="100"/></a>
___
<center>Copyright LanLearning 2020</center>





# Welcome back to Pandas! Part 3!

In this notebook, we will be working with another simple DataFrame and doing some more data manipulations! Yay!!

Here are some things you will learn:
- how to import big data from Kaggle 
- how to clean part of the dataset
- manipulating the values in a dataset
- grouping your table by certain features/attributes
- putting conditions on your dataframe

In [None]:
# first thing's first,

import pandas as pd

## About the Data:
Just to give you a feel for the different domains within data science, we will use various types of datasets. 

Today's dataset will be related to music. The file is in the link below, which is taken directly from the internet: https://www.kaggle.com/geomack/spotifyclassification. 

Let's get into it.

### If the cell below errors:
Copy the file path for the file music data ```music_data``` and paste it inside the parenthesis in the cell below.

```pd.read_csv()``` the file ```music_data``` and store it under the variable name ```music```, you should be good to go! 

In [None]:
music = pd.read_csv('musics_data.csv')
music.head(10)

In [None]:
music = pd.read_csv('music_data.csv')
music.head()

## Exploratory Data Analysis
Before we go ahead and work with a dataset, we first need to explore it. This will give us enough context to help us perform meaningful data science techniques later on. 

There are a bunch of things you can do to explore the dataset that we haven't learned about. But we'll just use what we know for now. 

In [None]:
# shape of the frame

music.shape

#### What does this mean?

In [None]:
music.describe()

### What does ```.describe()``` do?

If you look, the row labels for the dataframe are above (this is the bolded column all the way out and to the left). 

This **describes** each of our columns that have numbers in them. 

- **count**: describes the number of elements in the column (which should be be the same as the number of rows)
- **mean**: represents the mean of the column
- **std**: represents the standard deviation of the column
- **min**: represents the minimum of the column
- **25, 50, 75%**: these are the quartiles 
    - what this means is that if you order your data numerical, then the value shown at  the \_\_% mark will be greater than \_\_% of your data
- **max**: represents the maximum of the column

### Using Set

Let's use our knowledge from past Python lessons to answer these questions: 

#### Who are some of the artists in our dataset? How many unique artists are represented in our data?

In [None]:
# that is the 'artist' column:
music['artist']

That up there ^ is the series of the ```artist``` column. 

This dataset features artists such as **Drake**, **Future**, and **The Chainsmokers**.

<br>
<img src='../pimages/future.jpeg' width="600"/>
<br>

### ```.values```
To get the values in that column, use ```.values```.

In [None]:
music['artist'].values

*Notice* how this is an array. Specifically, a **numpy array**. Now we can work with this, but first, let's store this array in a variable so we can use it later:

In [None]:
artists = music['artist'].values

In [None]:
len(artists)

# this makes sense as there are 2017 rows in our data.

Remember that in past Python notebooks calling ```set()``` on an array or list will remove the duplicates:

In [None]:
artists_set = set(artists)

In [None]:
# calling len() on artist_set will show us how many unique artists there are, right?

len(artists_set)

This means there are ```1343``` unique artists in our dataset. 

## A Shortuct

Another way to quickly find unique values is by using ```.unique()```, which will give you an array with no duplicate artists:

In [None]:
music['artist'].unique()

In [None]:
artists_using_unique = music['artist'].unique()

In [None]:
len(artists_using_unique)

It's the same length as above!

## How many times does Drake appear in the dataset? 

**Drake** is a ```value``` in our dataset. To see how many times he shows up, we can use a method called ```value_counts()```, which shows us the counts for each of the values in a column (which is ```artists``` in this scenario):

In [None]:
music['artist'].value_counts()

Drake appears 16 times in our frame, more than anyone else. 

Rick Ross appears 13 times. 

Disclosure appears 12 times in our frame above. 

## A More In-Depth Look:
Let's use our frame and take a more in-depth look at one of the artists ... let's choose **Disclosure**! 

<br>
<img src='../pimages/disclosure.jpg' width="600"/>
<br>

Let's take a look at all of our rows where Disclosure is the artist: 

Here's some things we know:
1. We know that 'Disclosure' will be in our 'artist' column
2. The 'artist' column is part of our ```music``` frame.

Essentially we will need to check for where ```artist``` ```==``` ```'Disclosure'``` in our frame. 

Take a look at the code first, and we'll go through the general process afterwards:

In [None]:
music[music['artist'] == 'Drake']

# == means equal to (simple operators)
# column selection from pandas2

In [None]:
music['artist'] == 'Disclosure'

### Make Some Observations: 
Make sure that all the rows have ```'Disclosure'``` as the artist! 

Lets double check that there is 12 rows (which is what we found earlier). 

In [None]:
music['artist'].value_counts()['Disclosure']

In [None]:
len(music[music['artist'] == 'Disclosure'])

### Let's Break Down the Code: 

We wrote ```music[music['artist'] == 'Disclosure']```. 

The conditional statement here is: ```music['artist'] == 'Disclosure'```. 

This is checking to see if the column ```'artist'``` in the ```music``` frame is ```==``` to ```'Disclosure'```. 

We can actually see what this returns: 

In [None]:
music['artist'] == 'Disclosure'

### A Series of Booleans:
The conditional statement with the pandas column (```music['artist'] == 'Disclosure'```) returned a boolean series (as displayed above), which we then used to index our frame to get what we wanted (```music[music['artist'] == 'Disclosure']```). <br>


**We need to pass in this series of booleans (which consist of ```True``` and ```False``` values that are based on our conditional statement) to slice into the frame so it displays only the indices where it is True.**

### The General Form:
Use ```df``` to represent your DataFrame, ```col``` for the column name, and ```item``` for the value in that column that you want: 

Write this condition as: ```df[df[col] == item]```. 

## Marshmello: 

<br>
<img src='../pimages/marsh.jpg' width="600"/>
<br>

Let's check if **Marshmello** is in our frame: 

In [None]:
music[music['artist'] == 'Marshmello']

### Combining Multiple Conditions:
How to slice into our DataFrame where we want not only one but **two** conditions to be met! 

#### Drake and Disclosure: 
For two conditions where you want both of them to be ```True```, you need to enclose each condition in parentheses and put a ```&``` in between.  Check it out: 

In [None]:
music[(music['artist'] == 'Drake') & (music['artist'] == 'Disclosure')]

Nothing is returned because we don't have an instance (in this case, a song), where the artist is both Drake and Disclosure.

#### Drake or Disclosure: 
For two conditions where you want at least one of them to be ```True```, enclose each condition in parentheses and put a ```|``` in between. Check it out: 

In [None]:
music[(music['artist'] == 'Drake') | (music['artist'] == 'Disclosure')]

In [None]:
#try implementing some of your own condition insideor experimenting to look at different artist.

With `|`, both songs by Drake and Disclosure are included.

<br>
<img src='../pimages/nipsey.jpeg' width="400"/>
<br>

## Recap 

- ```.describe()``` provides statistical insights (such as count, mean, std, min, max) about the columns in a dataframe whose values are numerical
- to find the number of unique values in a column within a dataframe, you can either use ```set()``` or ```unique()```:
    - ```len(set(df_name["column_name"].values))```
    - ```len(df_name["column_name"].unique())```
- use ```value_counts()``` to see how many times each value appears in a specific column, ex: ```df_name["column_name"].value_counts()```
- if you want to slice your dataframe to view only your desired portion, you should follow the written format ```df[df["col"] == "item"]```, which will give you the parts of the original dataframe that include the name ```item``` in the column ```col```
- if you want to slice your dataframe to view only the portion of it that meets two separate requirements, you should follow the format ```df[(df["col"] == "item_1") & (df["col"] == "item_2")]```
- if you want to slice your dataframe to view only the portion of it that meets at least one of two separate requirements, you should follow the format ```df[(df["col"] == "item_1") | (df["col"] == "item_2")]```

# End
In the next notebook, we'll add a column to our ```music``` frame.

### About this notebook: 
#### Developed by:
* [Milan Butani](https://www.linkedin.com/in/milanbutani/) 
* [Kyra Yee](https://www.linkedin.com/in/kyrayee/)
* [Jacqueline Mei](https://www.linkedin.com/in/jacqueline-mei-9140401aa/)
* [Liam McDonough](https://www.linkedin.com/in/liammmcdonough/)
* [Amy Tran](https://www.linkedin.com/in/amytran2303/)

#### Connect with us:
<a href='https://www.linkedin.com/company/lanlearning/'> <img src=https://img.icons8.com/color/48/000000/linkedin.png width="48" height="48" align="left"/></a>

<a href='http://www.instagram.com/lanlearning'> <img src=https://img.icons8.com/fluent/48/000000/instagram-new.png width="48" height="48" align="left"/></a>

<a href='https://www.youtube.com/channel/UC5_yxU9pz4ka7xITJMxO5WA'> <img src=https://img.icons8.com/color/48/000000/youtube-squared.png width="48" height="48" align="left"/></a>

<a href='https://www.github.com/lanlearning/'> <img src=https://img.icons8.com/material-rounded/48/000000/github.png/ width="48" height="48" align="left"/></a>


