# Getting to know your dataset

### Data Understanding

The goal of the Business Understanding step of the Data Science Process is to develop a clear statement of the question or problem you are trying to solve. Once you have that problem clarified, you're ready to dig into your data in the **Data Understanding** step:

The Data Understanding step of the Data Science Process involves building an initial understanding of your data by exploring and visualizing your data. We'll address visualizing next week.

This week, we'll discuss your intial exploration of the data with the pandas library. 

That initial exploration includes:

1. Collecting and loading the data:
- Gather the data you need from relevant sources, such as databases, files, or APIs. 
- Load the data into a pandas DataFrame

2. Exploring the data: 
- Look at the organization of your data, such as 
	- the number of rows and columns
	- variable names
	- data types.
- Familiarize yourself with the data by looking at a few rows or records.
- This will give you an idea of the size and complexity of the dataset and is an important intial sanity check.


3. Summarizing the data: 
- Generate basic statistics (such as counts, averages, and percentages) for each variable or feature. - This will help you understand the distribution of the data and identify potential issues, like outliers or missing values.


Next week , we'll work on creating visualizations of the data with the matplotlib library.

### 1.1 Loading the data

In this section, we'll see how to read data into a DataFrame from CSV (comma-separated values) files, one of the most common data formats you'll encounter. 


Imagine your boss at a talent agency emails you the `spotify-dataset.csv` file. The file is a dataset of songs and their characteristics that she downloaded from Spotify. She also sends a data dictionary so you can reference the variables.

The data dictionary:

- `title`: name of the Track 
- `artist`: name of the Artist 
- `year`: release year of the track 
- `bpm`: beats per minute; the tempo of the song 
- `en`: energy; the higher the value, the more energetic a song 
- `dnce`: danceability, the higher the value, the easier it is to dance to this song
- `loud`: loudness; the higher the value, the louder the song
- `val`: valence; the higher the value, the more positive mood for the song
- `dur`: duration; length of the song
- `acous`: acoustic; the higher the value the more acoustic the song is
- `speech`: speechiness; the higher the value the more spoken words the song contains 
- `pop`: popularity; the higher the value the more popular the song is

#### Viewing Datasets

To import the `spotify-dataset.csv` file as a DataFrame, we'll use the `read_csv()` method like this:

`dataframe_name = pd.read_csv(filename)`
 
The `read_csv()` method will read the file specified by the filename, parse the data contained within that file, and store it all in a DataFrame. Let's import the CSV file:


In [None]:
import pandas as pd

# read the CSV file into a DataFrame
music = pd.read_csv("../datasets/spotify-dataset.csv")

# display the DataFrame 
music

### 1.2 Exploring the data

#### Previewing the DataFrame

- Ensuring the correctness of a .csv file post-import can help identify any discrepancies present in the imported data. 
- This precautionary measure prevents us from diving into data analysis and subsequently encountering unexpected outcomes due to incomplete data or the file not adhering to the standard comma-separated values format.

- A good place to start is with the `.shape` attribute that allows you to quickly check the dimensions (number of rows and columns) of your DataFrame.
- This can help you confirm that you've loaded your data correctly. For example, if you were expecting to load a dataset with 100 rows and 5 columns, and df.shape returns (100, 5), that's a good sign that your data has been loaded correctly. If it returns something different, it may mean there was an error when importing the file, or that the file's structure wasn't what you expected.
- Another way it can be helpful is in identifying missing data:
    - If you remove or filter rows with missing data, checking df.shape again can tell you how many rows were removed.

Here is how you use it:

`df_name.shape` 

NOTE: `.shape` is an attribute, not a method, so you don't use parentheses with it: `df.shape`, not `df.shape()`.


Let’s take a look at our music dataset:

In [None]:
# check the dimensions of your DataFrame
music.shape

#### Preview the DataFrame

After importing data from a .csv file, using the `head()` and `tail()` methods to look at the first few and last few rows of your DataFrame can help you visually confirm that your data has been loaded correctly. You can check things like:

- The column names are correct.
- The data in each column is of the right type (e.g., text, number).
- The data doesn't contain unexpected values or errors.

For example, suppose you're expecting a DataFrame with the columns 'Name', 'Age', and 'City'. After loading your DataFrame, you can use `df.head()` to quickly check that these columns exist and contain the data types you expect, i.e., 'Name' is text, 'Age' is numerical and 'City' is text.


Use the `head()` method like this to see the first 5 rows:
 
`df_name.head()`

You can also specify the number of rows: 

`df_name.head(n=number_of_rows)`


Similarly, the tail() method shows the last 'n' rows of the DataFrame:

`df_name.tail()`
 
Let's check the `music` dataset:

In [None]:
# return first five rows
music.head()

In [None]:
# return bottom five rows
music.tail()

In [None]:
# return first ten rows
music.head(n=10)

### 1.3 Summarizing the DataFrame

- Imagine you've just received a big set of data and you want to get an overall idea of what it looks like.
- The `describe()` method is a one-stop-shop for getting a quick summary of your data.
- When you apply describe() to your dataset, it calculates a bunch of different things all at once. 
    - the count (how many items there are)
    - the mean (the average)
    - the standard deviation (how spread out the data is)
    - the minimum and maximum values, and
    - the quartiles (which split the data into four equal parts).


You can apply  `describe()` to the entire DataFrame like this:

`dataframe_name.describe()`

Or you can apply it to one variable like this:

`dataframe_name['column_name'].describe()`

Let's get a first pass look at the music DataFramewith describe():

In [None]:
# Summarize the DataFrame
music.describe()