# Session 1 - Data Science Project Intro

Python Notebook [Colab/Jupyter] includes:
*   Familiarising participants with Colab
*   Coding best practices - good variable names, inline comments
*   Pandas and Numpy intro - benefits, differences between arrays and lists and computational merits of each, when to use which, etc.
*   Data overview - data dimension, column types, scope

#Coding Best Practices

1.  **Use descriptive and meaningful variable names:** Avoid using generic variable names like x, y, and z. Instead, choose names that are descriptive and indicate the purpose or content of the variable or function. This makes the code more readable and easier to understand.

2.  **Use lowercase letters:** In Python, variable and function names should start with a lowercase letter. This is a convention that is widely used and makes it easy to distinguish between variables/functions and classes.

3.  **Use underscores for readability:** Use underscores to separate words in a variable or function name. This makes the name more readable and easier to understand.

4.  **Avoid using reserved keywords:** Avoid using Python's reserved keywords, such as "if", "else", "while", "for", etc. as variable or function names.

5.  **Use camel case for class names:** In Python, class names should start with a capital letter and use camel case. For example, MyClass. This is to distinguish individual words within a compound name.

6.  **Be consistent:** Be consistent with naming conventions throughout your code. If you choose to use underscores to separate words in variable names, use the same convention throughout your code.

7.  **Avoid abbreviations:** Avoid using abbreviations unless they are widely understood in the context of your code. Abbreviations can make code harder to understand.

8.  **Keep function names concise:** Keep function names concise and to the point. A good function name should describe what the function does in a single word or phrase.

9.  **Use singular nouns for variables:** Use singular nouns for variables. For example, use "customer" instead of "customers" if you are referring to a single customer.

10.  **Inline comments:** **Comment code using # for easier readability**. Use comments to explain your process and reasoning to track your work and for others to follow along.

# Introduction to Pandas and NumPy

**Pandas:**

*  High-level data manipulation library for Python, designed for handling structured data (e.g., tables)
*  Provides DataFrame and Series data structures for flexible data handling
*  Offers built-in functions for data cleaning, aggregation, transformation, and visualization

**NumPy:**

*  Core library for numerical computing in Python, optimized for working with multi-dimensional arrays (ndarrays)
*  Offers efficient array operations, broadcasting, and mathematical functions
widely used as a foundation for other scientific computing libraries

# Arrays, Lists, and Usage

**Differences between arrays and lists:**

*   Arrays (NumPy ndarrays) are homogeneous (same data type), while lists can hold mixed data types.
*   Arrays are more memory-efficient and faster for numerical operations than lists.

**Computational merits of arrays:**

*   Arrays provide vectorized operations, enabling element-wise calculations without explicit loops.
*   NumPy's underlying implementation in C/Fortran allows for optimized performance and parallelization.

**Computational merits of lists:**

*  Lists are more flexible in terms of data types and resizing, allowing for diverse data storage.
*  Built-in Python data structure, no additional library import required.
*  Use Python lists for small datasets or general-purpose programming where flexibility and simplicity are more important.


**Computational merits of Numpy vs Pandas:**
*  Use **NumPy** arrays for numerical operations, large datasets, or when performance is critical.
*  Use **Pandas** for handling structured data, complex data manipulation, or when working with mixed data types. When working with data in Pandas, it is referring to a two-dimensional tabular data structure with labeled rows and columns, similar to a table. We call this s DataFrame.


#Load the data
Upload a copy of the dataset from your local machine into the session folders. The data will only stay in the notebook session for the duration of the session and you will need to do this each time if you follow this method.

OR

There are other methods you can use. For example, you can mount a Google Drive to your notebook in Colab or you can read the file from a URL.

If you are using Jupyter Notebook, the methods will vary slightly. You can still read from a url or local file.

In [None]:
# Import statements for the libraries needed
import pandas as pd
import numpy as np

In [None]:
# Read in the data from uploaded file to our session
df = pd.read_csv("data.csv")

In [None]:
#Read the file from url
#df = pd.read_csv("https://www.kaggle.com/datasets/geomack/spotifyclassification")


# Data Overview
In Pandas, **info() is a method that can be called on a DataFrame to provide a concise summary of the data contained within the DataFrame.** The info() method provides information about:
*   The total number of entries (rows) in the DataFrame.
*   The data type of each column.
*   The number of non-null values in each column.
*   The amount of memory used by the DataFrame.

This method is particularly useful for understanding the structure of a DataFrame and identifying potential data quality issues, such as missing values or incorrect data types. It is also helpful in identifying memory usage and optimization opportunities.


In Pandas, **the describe() method is used to generate descriptive statistics of a DataFrame or a Series.** It provides a summary of the central tendency, dispersion, and shape of the distribution of a dataset.

**When you call describe() on a DataFrame or Series, it computes and returns the following statistics:**
*   Count: the number of non-null values in each column.
*   Mean: the average value of each column.
*   Standard deviation: a measure of how much the values in a column vary from the mean.
*   Minimum and maximum values: the lowest and highest values in each column.
*   Quartiles: 25%, 50%, and 75% percentiles of the data.

These statistics are computed separately for numeric and non-numeric columns. The describe() method can also be customized by specifying a subset of columns to be included in the summary statistics, or by changing the percentiles used to compute the quartiles.

Overall, **the describe() method is a useful tool for quickly getting an overview of the distribution and range of values in a DataFrame or Series.**

#What does our data look like?

In [None]:
# Look at the first five rows of the data
df.head(10)

Unnamed: 0.1,Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence,target,song_title,artist
0,0,0.0102,0.833,204600,0.434,0.0219,2,0.165,-8.795,1,0.431,150.062,4,0.286,1,Mask Off,Future
1,1,0.199,0.743,326933,0.359,0.00611,1,0.137,-10.401,1,0.0794,160.083,4,0.588,1,Redbone,Childish Gambino
2,2,0.0344,0.838,185707,0.412,0.000234,2,0.159,-7.148,1,0.289,75.044,4,0.173,1,Xanny Family,Future
3,3,0.604,0.494,199413,0.338,0.51,5,0.0922,-15.236,1,0.0261,86.468,4,0.23,1,Master Of None,Beach House
4,4,0.18,0.678,392893,0.561,0.512,5,0.439,-11.648,0,0.0694,174.004,4,0.904,1,Parallel Lines,Junior Boys
5,5,0.00479,0.804,251333,0.56,0.0,8,0.164,-6.682,1,0.185,85.023,4,0.264,1,Sneakin’,Drake
6,6,0.0145,0.739,241400,0.472,7e-06,1,0.207,-11.204,1,0.156,80.03,4,0.308,1,Childs Play,Drake
7,7,0.0202,0.266,349667,0.348,0.664,10,0.16,-11.609,0,0.0371,144.154,4,0.393,1,Gyöngyhajú lány,Omega
8,8,0.0481,0.603,202853,0.944,0.0,11,0.342,-3.626,0,0.347,130.035,4,0.398,1,I've Seen Footage,Death Grips
9,9,0.00208,0.836,226840,0.603,0.0,7,0.571,-7.792,1,0.237,99.994,4,0.386,1,Digital Animal,Honey Claws


In [None]:
#view the shape of the data
#the rows and the columns
df.shape

(2017, 17)

In [None]:
#the info() - shows all the information about the data
#here you can view the data types, look at the total count for each rows
#check if the data contains null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2017 entries, 0 to 2016
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Unnamed: 0        2017 non-null   int64  
 1   acousticness      2017 non-null   float64
 2   danceability      2017 non-null   float64
 3   duration_ms       2017 non-null   int64  
 4   energy            2017 non-null   float64
 5   instrumentalness  2017 non-null   float64
 6   key               2017 non-null   int64  
 7   liveness          2017 non-null   float64
 8   loudness          2017 non-null   float64
 9   mode              2017 non-null   int64  
 10  speechiness       2017 non-null   float64
 11  tempo             2017 non-null   float64
 12  time_signature    2017 non-null   int64  
 13  valence           2017 non-null   float64
 14  target            2017 non-null   int64  
 15  song_title        2017 non-null   object 
 16  artist            2017 non-null   object 


In [None]:
# See the statistics of the data
# The describe(), only work with numerical data
# It also shows the total count for each rows, mean, and standard deviation
df.describe()

Unnamed: 0.1,Unnamed: 0,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence,target
count,2017.0,2017.0,2017.0,2017.0,2017.0,2017.0,2017.0,2017.0,2017.0,2017.0,2017.0,2017.0,2017.0,2017.0,2017.0
mean,1008.0,0.18759,0.618422,246306.2,0.681577,0.133286,5.342588,0.190844,-7.085624,0.612295,0.092664,121.603272,3.96827,0.496815,0.505702
std,582.402066,0.259989,0.161029,81981.81,0.210273,0.273162,3.64824,0.155453,3.761684,0.487347,0.089931,26.685604,0.255853,0.247195,0.500091
min,0.0,3e-06,0.122,16042.0,0.0148,0.0,0.0,0.0188,-33.097,0.0,0.0231,47.859,1.0,0.0348,0.0
25%,504.0,0.00963,0.514,200015.0,0.563,0.0,2.0,0.0923,-8.394,0.0,0.0375,100.189,4.0,0.295,0.0
50%,1008.0,0.0633,0.631,229261.0,0.715,7.6e-05,6.0,0.127,-6.248,1.0,0.0549,121.427,4.0,0.492,1.0
75%,1512.0,0.265,0.738,270333.0,0.846,0.054,9.0,0.247,-4.746,1.0,0.108,137.849,4.0,0.691,1.0
max,2016.0,0.995,0.984,1004627.0,0.998,0.976,11.0,0.969,-0.307,1.0,0.816,219.331,5.0,0.992,1.0


In [None]:
# Check whether there are any null (NaN) values
df.isnull().values.any()

False

In [None]:
#Aside from the code above, you can also use sum()
#this will sum the total null values in the dataframe.
df.isnull().sum().sum()

0

If there are null values:



1.   Analyze the pattern: Investigate the reason behind null values to determine if they are random or systematic, which may affect the chosen method.
2.   Remove missing data: In cases of limited null values, dropping rows or columns with missing data can be a viable solution, using functions like dropna() in pandas.
3.   Impute values: Replace missing data with estimated values based on available data, such as mean, median, or mode imputation, or more advanced techniques like k-Nearest Neighbors (KNN) or regression imputation.

4.   Use categorical placeholders: For categorical variables, consider introducing a new category (e.g., "Unknown") to represent missing data.

5.   Incorporate uncertainty: Utilize probabilistic models or Bayesian techniques to account for uncertainty arising from missing data, allowing for more robust analyses and conclusions.

In [None]:
#Check if there are duplicated values in the data
df.duplicated().sum()

0

If there are duplicated values:

1. Check if those values are unique in each columns, if they are unique and relevant for the analysis, keep the data.

2. If the data is duplicated then drop it to avoid bias in the analysis.



#Next Steps
Now that we have loaded and viewed the data to get a sense of what we are workign with, it's time to explore the data further for insight. In the next session, we will learn basic Expoloratory Data Analysis (EDA) techniques and some visualization.


#Save Notebook

If you need a copy of the notebook locally for future use or to use in a new notebook, you must save a .ipynb file.
