# Reading Files & Intro to Pandas

To use pandas, you'll typically start with the following line of code.

In [3]:
import pandas as pd

There are two core objects in pandas: the DataFrame and the Series.

## Dataframe

A DataFrame is a table. It contains an array of individual entries, each of which has a certain value. Each entry corresponds to a row (or record) and a column.

We are using the pd.DataFrame() constructor to generate these DataFrame objects. The syntax for declaring a new one is a dictionary whose keys are the column names (Bob and Sue in this example), and whose values are a list of entries. This is the standard way of constructing a new DataFrame, and the one you are most likely to encounter.

In [4]:
pd.DataFrame({'David': [81, 93], 
              'Patty': [99, 91]
             })

Unnamed: 0,David,Patty
0,81,99
1,93,91


In this example, we are assigning an index to the dataframe to make it more readable

In [5]:
pd.DataFrame({'David': [81, 93], 
              'Patty': [99, 91]},
             index = ['Midterm', 'Final'])

Unnamed: 0,David,Patty
Midterm,81,99
Final,93,91


## Series

A Series, by contrast, is a sequence of data values. If a DataFrame is a table, a Series is a list. And in fact you can create one with nothing more than a list:

In [6]:
pd.Series([1, 2, 3, 4, 5])

0    1
1    2
2    3
3    4
4    5
dtype: int64

A Series is, in essence, a single column of a DataFrame. So you can assign column values to the Series the same way as before, using an index parameter. However, a Series does not have a column name, it only has one overall name:

In [7]:
pd.Series([99, 91], index=['Midterm', 'Final'], name='Test Scores')

Midterm    99
Final      91
Name: Test Scores, dtype: int64

# Reading Data Files

For this demonstration, we are going to be using a subset of a [Spotify Dataset on Kaggle](https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks?select=data.csv). I've gone ahead and reduced the number of records significantly, but you can download the full dataset in the link above. 

## Operating System (OS)

First thing to note is about directories. You can always pass the entire file path into the open argument, but sometimes it's easier to just use the file name. If that's the case for you, then make sure you check your current directory with the os package

In [8]:
import os

os.getcwd()

'C:\\Users\\mfederighi\\Jupyter_WD\\Python Learning Group\\Intro to Pandas'

If you want to change your current working directory, then you can do so by doing the following

In [10]:
os.chdir('C:\\Users\\mfederighi\\Jupyter_WD\\Python Learning Group\\JupyterLab Demo')
         
os.getcwd()

'C:\\Users\\mfederighi\\Jupyter_WD\\Python Learning Group\\Intro to Pandas'

There are multiple ways you can read files into python, but for now we are only going to look at the Pandas module

## Read a Single CSV

Being able to create a DataFrame or Series by hand is handy. But, most of the time, we won't actually be creating our own data by hand. Instead, we'll be working with data that already exists.

Data can be stored in any of a number of different forms and formats. By far the most basic of these is the humble CSV file.  

In [11]:
spotify_2020 = pd.read_csv("Spotify_2020.csv")

spotify_2020.head() # head shows the first 5 rows by default. If you want more or less, put the number inside the parenthesis

Unnamed: 0,valence,year,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo
0,0.756,2020,0.221,"['24kGoldn', 'iann dior']",0.7,140526,0.722,1,3tjFYV6RSFtuktYl3ZtYcq,0.0,7,0.272,-3.558,0,Mood (feat. iann dior),99,2020-07-24,0.0369,90.989
1,0.347,2020,0.114,"['Pop Smoke', 'Lil Baby', 'DaBaby']",0.823,190476,0.586,1,0PvFJmanyNQMseIFrU708S,0.0,6,0.193,-6.606,0,For The Night (feat. Lil Baby & DaBaby),95,2020-07-03,0.2,125.971
2,0.357,2020,0.0194,"['Cardi B', 'Megan Thee Stallion']",0.935,187541,0.454,1,4Oun2ylbjFKMPTiaSbbCih,0.0,1,0.0824,-7.509,1,WAP (feat. Megan Thee Stallion),96,2020-08-07,0.375,133.073
3,0.522,2020,0.244,"['Drake', 'Lil Durk']",0.761,261493,0.518,1,2SAqBLGA283SUiwJ3xOUVI,3.5e-05,0,0.107,-8.871,1,Laugh Now Cry Later (feat. Lil Durk),93,2020-08-14,0.134,133.976
4,0.682,2020,0.468,['Ariana Grande'],0.737,172325,0.802,1,35mvY5S1H3J2QZyna3TFe0,0.0,0,0.0931,-4.771,1,positions,96,2020-10-30,0.0878,144.015


There are many different parameters in the read_csv function, [all of which can be seen here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)

You can change the separator type, the delimeter, skip the first row, change the column names, etc. There are a lot of options to read in your data and format it appropriately. Here's an example of some of those parameters: 

In [14]:
# Read the file, take only the first 3 columns, rename the columns, change the encoding
spotify_2020_test = pd.read_csv("Spotify_2020.csv", usecols = [0, 1, 2], names = ['Valence', 'Year', 'Acousticness'], header = 1)

spotify_2020_test.head()

Unnamed: 0,Valence,Year,Acousticness
0,0.347,2020,0.114
1,0.357,2020,0.0194
2,0.522,2020,0.244
3,0.682,2020,0.468
4,0.145,2020,0.401


# General Functions for a Pandas Dataframe

There are a lot of built-in pandas functions to help you get an idea of your dataframe structure

In [15]:
spotify_2020.shape # shows the number of rows and number of columns

(2030, 19)

In [16]:
spotify_2020.size # Shows you the file size

38570

In [17]:
spotify_2020.columns # shows you the columns names

Index(['valence', 'year', 'acousticness', 'artists', 'danceability',
       'duration_ms', 'energy', 'explicit', 'id', 'instrumentalness', 'key',
       'liveness', 'loudness', 'mode', 'name', 'popularity', 'release_date',
       'speechiness', 'tempo'],
      dtype='object')

In [None]:
# Alternatively, this is how you can change the column names after the dataframe has been defined
# spotify_2020.columns = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j','k', 'l']

In [18]:
spotify_2020.dtypes # See all the data types in the dataframe

valence             float64
year                  int64
acousticness        float64
artists              object
danceability        float64
duration_ms           int64
energy              float64
explicit              int64
id                   object
instrumentalness    float64
key                   int64
liveness            float64
loudness            float64
mode                  int64
name                 object
popularity            int64
release_date         object
speechiness         float64
tempo               float64
dtype: object

In [19]:
spotify_2020.info() # gives a high-level view of the dataframe, including data types and null values

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2030 entries, 0 to 2029
Data columns (total 19 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   valence           2030 non-null   float64
 1   year              2030 non-null   int64  
 2   acousticness      2030 non-null   float64
 3   artists           2030 non-null   object 
 4   danceability      2030 non-null   float64
 5   duration_ms       2030 non-null   int64  
 6   energy            2030 non-null   float64
 7   explicit          2030 non-null   int64  
 8   id                2030 non-null   object 
 9   instrumentalness  2030 non-null   float64
 10  key               2030 non-null   int64  
 11  liveness          2030 non-null   float64
 12  loudness          2030 non-null   float64
 13  mode              2030 non-null   int64  
 14  name              2030 non-null   object 
 15  popularity        2030 non-null   int64  
 16  release_date      2030 non-null   object 


In [20]:
spotify_2020.describe() # Does a general summarization of the data for you. Note that strings are excluded from this view

Unnamed: 0,valence,year,acousticness,danceability,duration_ms,energy,explicit,instrumentalness,key,liveness,loudness,mode,popularity,speechiness,tempo
count,2030.0,2030.0,2030.0,2030.0,2030.0,2030.0,2030.0,2030.0,2030.0,2030.0,2030.0,2030.0,2030.0,2030.0,2030.0
mean,0.501048,2020.0,0.219931,0.692904,193728.397537,0.631232,0.495567,0.016376,5.314286,0.178535,-6.595067,0.581281,64.30197,0.141384,124.283129
std,0.222656,0.0,0.240484,0.147439,46758.66155,0.163255,0.500104,0.102608,3.647307,0.137607,3.035709,0.493471,22.107782,0.12673,29.953435
min,0.0,2020.0,2e-06,0.0,30579.0,0.00512,0.0,0.0,0.0,0.0242,-38.414,0.0,0.0,0.0,0.0
25%,0.334,2020.0,0.0358,0.599,167442.0,0.532,0.0,0.0,2.0,0.0988,-7.62925,0.0,65.0,0.047,98.08525
50%,0.509,2020.0,0.131,0.716,190588.0,0.641,0.0,0.0,5.0,0.123,-6.11,1.0,70.0,0.0864,124.7105
75%,0.66875,2020.0,0.321,0.799,215993.0,0.746,1.0,1.8e-05,8.0,0.208,-4.8855,1.0,76.0,0.207,145.1255
max,0.976,2020.0,0.994,0.98,646239.0,1.0,1.0,0.963,11.0,0.953,-0.836,1.0,100.0,0.894,205.895


Note that release_date is an object and not datetime. We can fix this while reading in the CSV or after we've already read it in

### After we've read it in

In [21]:
import datetime # https://docs.python.org/3/library/datetime.html

In [22]:
spotify_2020['release_date'] = pd.to_datetime(spotify_2020['release_date'], format='%Y-%m-%d')

print(spotify_2020['release_date'].head(), '\n')
print(spotify_2020.info())

0   2020-07-24
1   2020-07-03
2   2020-08-07
3   2020-08-14
4   2020-10-30
Name: release_date, dtype: datetime64[ns] 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2030 entries, 0 to 2029
Data columns (total 19 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   valence           2030 non-null   float64       
 1   year              2030 non-null   int64         
 2   acousticness      2030 non-null   float64       
 3   artists           2030 non-null   object        
 4   danceability      2030 non-null   float64       
 5   duration_ms       2030 non-null   int64         
 6   energy            2030 non-null   float64       
 7   explicit          2030 non-null   int64         
 8   id                2030 non-null   object        
 9   instrumentalness  2030 non-null   float64       
 10  key               2030 non-null   int64         
 11  liveness          2030 non-null   float64       
 12  loudness     

### While we're reading it in

In [23]:
spotify_2020 = pd.read_csv("Spotify_2020.csv", parse_dates = ['release_date'])

print(spotify_2020['release_date'].head(), '\n')
print(spotify_2020.info())

0   2020-07-24
1   2020-07-03
2   2020-08-07
3   2020-08-14
4   2020-10-30
Name: release_date, dtype: datetime64[ns] 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2030 entries, 0 to 2029
Data columns (total 19 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   valence           2030 non-null   float64       
 1   year              2030 non-null   int64         
 2   acousticness      2030 non-null   float64       
 3   artists           2030 non-null   object        
 4   danceability      2030 non-null   float64       
 5   duration_ms       2030 non-null   int64         
 6   energy            2030 non-null   float64       
 7   explicit          2030 non-null   int64         
 8   id                2030 non-null   object        
 9   instrumentalness  2030 non-null   float64       
 10  key               2030 non-null   int64         
 11  liveness          2030 non-null   float64       
 12  loudness     

Type in spotify_2020. and then hit tab. It will show you all the options you can use. 

Alternatively, you can go [straight to the documentation](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html) to find all the functions available. Many of these you will never use, but it's helpful to understand what they do. 

In [24]:
import datetime as dt
from datetime import datetime

custom_date_parser = lambda x: datetime.strptime(x, "%Y-%m-%d") # here you can define custom datetime formats

spotify_2020 = pd.read_csv("Spotify_2020.csv", parse_dates = ['release_date'], date_parser = custom_date_parser)

print(spotify_2020['release_date'].head(), '\n')
print(spotify_2020.info())

0   2020-07-24
1   2020-07-03
2   2020-08-07
3   2020-08-14
4   2020-10-30
Name: release_date, dtype: datetime64[ns] 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2030 entries, 0 to 2029
Data columns (total 19 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   valence           2030 non-null   float64       
 1   year              2030 non-null   int64         
 2   acousticness      2030 non-null   float64       
 3   artists           2030 non-null   object        
 4   danceability      2030 non-null   float64       
 5   duration_ms       2030 non-null   int64         
 6   energy            2030 non-null   float64       
 7   explicit          2030 non-null   int64         
 8   id                2030 non-null   object        
 9   instrumentalness  2030 non-null   float64       
 10  key               2030 non-null   int64         
 11  liveness          2030 non-null   float64       
 12  loudness     

# Reading in Multiple Files

## Least Efficient Way

In [25]:
csv_file_list = os.listdir() # Use the os.listdir to see all files in the current directory

csv_file_list

['.ipynb_checkpoints',
 'Intro To Pandas.ipynb',
 'Spotify_2016.csv',
 'Spotify_2017.csv',
 'Spotify_2018.csv',
 'Spotify_2019.csv',
 'Spotify_2020.csv']

As you can see, this is not a clean directory with only CSVs, so we will need to clean this up

In [26]:
csv_file_list = csv_file_list[2:] # Only grab files with index value 1 - 6, not including 6

csv_file_list

['Spotify_2016.csv',
 'Spotify_2017.csv',
 'Spotify_2018.csv',
 'Spotify_2019.csv',
 'Spotify_2020.csv']

In [27]:
# Create an empty list to append the dataframes to
list_of_dataframes = []

# Looping - For each filename in the csv_file_list
for filename in csv_file_list:
    
    # Apply the read_csv function to the filename, then append the file to the empty list
    list_of_dataframes.append(pd.read_csv(filename, parse_dates = ['release_date']))

# Concatenate all the CSVs together with the Concat function    
spotify_dataset = pd.concat(list_of_dataframes)

# View the head
spotify_dataset.head()

Unnamed: 0,valence,year,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo
0,0.225,2016,0.701,['James Arthur'],0.311,208827,0.485,0,55Am8neGJkdj2ADaM3aw5H,0.0,6,0.0726,-5.726,0,Train Wreck,88,2016-10-28,0.0365,77.355
1,0.291,2016,0.0689,['Post Malone'],0.556,223347,0.538,1,75ZvA4QfFiZvzhj2xkaWAh,0.0,8,0.196,-5.408,0,I Fall Apart,83,2016-12-09,0.0382,143.95
2,0.488,2016,0.245,['Ricky Montgomery'],0.639,216880,0.526,0,0MF5QHFzTUM2dYm6J7Vngt,2.4e-05,0,0.25,-6.697,1,Mr Loverman,82,2016-04-08,0.0256,130.033
3,0.494,2016,0.695,['James Arthur'],0.358,211467,0.557,0,5uCax9HTNlzGybIStD3vDh,0.0,10,0.0902,-7.398,1,Say You Won't Let Go,86,2016-10-28,0.059,85.043
4,0.572,2016,0.167,['Childish Gambino'],0.743,326933,0.347,1,0wXuerDYiBnERgIpbb3JBR,0.00951,1,0.103,-11.174,1,Redbone,82,2016-12-02,0.121,160.143


## More Efficient Way

The glob module finds all the pathnames matching a specified pattern according to the rules used by the Unix shell, although results are returned in arbitrary order.

In [28]:
import glob

In [29]:
glob.glob('*.csv') # Returns a list of all files ending in CSV

['Spotify_2016.csv',
 'Spotify_2017.csv',
 'Spotify_2018.csv',
 'Spotify_2019.csv',
 'Spotify_2020.csv']

Note the star at the beginning. That means Glob doesn't care what's at the beginning of the pattern, it only cares about finding the .csv pattern

In [30]:
glob.glob('Spotify*') # Look for all files that contain Spotify in the name

['Spotify_2016.csv',
 'Spotify_2017.csv',
 'Spotify_2018.csv',
 'Spotify_2019.csv',
 'Spotify_2020.csv']

In [31]:
glob.glob('Spotify_202*.csv') # Matches all of the filenames which begin with “Spotify_202” and ends with ”.csv”

['Spotify_2020.csv']

## Most Efficient Way

### For Loop

In [32]:
# Create an empty list to append the dataframes to
list_of_dataframes = []

# Looping - For each filename in the csv_file_list
for filename in glob.glob('Spotify*.csv'):
    
    # Apply the read_csv function to the filename, then append the file to the empty list
    list_of_dataframes.append(pd.read_csv(filename, parse_dates = ['release_date']))

# Concatenate all the CSVs together with the Concat function    
spotify_2016_2020 = pd.concat(list_of_dataframes)

# View the unique years in the dataframe
spotify_2016_2020['year'].unique()

array([2016, 2017, 2018, 2019, 2020], dtype=int64)

### List Comprehension

In [33]:
spotify_2016_2020 = pd.concat([pd.read_csv(filename, parse_dates = ['release_date']) for filename in glob.glob('Spotify*.csv')])

spotify_2016_2020['year'].unique()

array([2016, 2017, 2018, 2019, 2020], dtype=int64)

### And Finally, Write Dataframe to CSV

In [35]:
spotify_2016_2020.to_csv('Spotify_2016_2020.csv', index=False) # note that index = False must be added otherwise an index column will appear in the csv

# Miscellaneous Pandas Functions

## Filtering

In [36]:
# Create new dataframe filtered to 2020 data only
spotify_2020 = spotify_2016_2020.loc[spotify_2016_2020['year'] == 2020]
spotify_2020['year'].unique()

array([2020], dtype=int64)

In [37]:
spotify_2020_popularity = spotify_2020.loc[(spotify_2020['popularity'] >= 90) & (spotify_2020['explicit'] == 0)]

print(spotify_2020_popularity.popularity.unique())
print(spotify_2020_popularity.explicit.unique())

[95 96 97 93 92 91 90 94]
[0]


In [38]:
# Selecting a few columns with a condition
spotify_2020_filtered = spotify_2016_2020.loc[(spotify_2016_2020['popularity'] >= 90), ['artists', 'name']]

spotify_2020_filtered.head()

Unnamed: 0,artists,name
0,['Clairo'],Sofia
1,['Harry Styles'],Watermelon Sugar
7,['Lewis Capaldi'],Someone You Loved
20,"['Topic', 'A7S']",Breaking Me
0,"['24kGoldn', 'iann dior']",Mood (feat. iann dior)


## Group By

In [40]:
# Group by popularity, count number of song IDs per popularity ranking
spotify_2020.groupby('popularity')['id'].count()

popularity
0      61
1      40
2      26
3      14
4      11
       ..
95      4
96      4
97      1
99      1
100     1
Name: id, Length: 92, dtype: int64

In [42]:
# Grouping by Artists, Counting the number of times they appear
artists_2020 = spotify_2020.groupby('artists').artists.count()
artists_2020.sort_values(ascending = False)

artists
['Future', 'Lil Uzi Vert']                75
['YoungBoy Never Broke Again']            32
['J Balvin']                              26
['NAV']                                   25
['BTS']                                   24
                                          ..
['King Von', 'Polo G']                     1
['King Von', 'Moneybagg Yo']               1
['Sturgill Simpson']                       1
['King Von', 'Lil Durk', 'Prince Dre']     1
['Ólafur Arnalds']                         1
Name: artists, Length: 1065, dtype: int64

In [48]:
# Group by artists, use agg function to count the number of times they appear, get max and min popularity
artists_2020 = spotify_2020.groupby(['artists', 'name']).agg({'popularity': ['min', 'max']})
artists_2020.T

Unnamed: 0_level_0,artists,"[""Olivia O'Brien""]","[""Olivia O'Brien""]","[""Why Don't We""]","['$NOT', 'Flo Milli']","['$NOT', 'Maggie Lindemann']","['$NOT', 'Night Lovell']","['$NOT', 'Wifisfuneral']","['$NOT', 'iann dior']",['$NOT'],['$NOT'],...,['mike.'],['niko-qée'],['niko-qée'],['niko-qée'],['ppcocaine'],['ppcocaine'],['ppcocaine'],['salem ilese'],['twocolors'],['Ólafur Arnalds']
Unnamed: 0_level_1,name,Josslyn,NOW,Fallin’ (Adrenaline),Mean,Moon & Stars (feat. Maggie Lindemann),Human (feat. Night Lovell),BERETTA (feat. Wifisfuneral),Like Me (feat. iann dior),GOSHA,Revenge,...,2 birds,Finest,Like a Rose,No sleep at all,DDLG,Hugh Hefner,PJ,Mad at Disney,Lovefool,We Contain Multitudes (from home)
popularity,min,77,68,81,65,76,67,69,70,78,70,...,66,0,0,0,65,67,68,88,84,70
popularity,max,77,68,81,71,76,67,69,70,78,70,...,66,0,0,0,65,67,68,88,84,70


## Creating New Columns

In [44]:
# Create a new column for minutes
spotify_2020['minutes'] = spotify_2020['duration_ms'] / 60000

# Create a new column for day of week for release date
spotify_2020['weekday'] = spotify_2020['release_date'].dt.day_name()

spotify_2020.head()

Unnamed: 0,valence,year,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,...,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,minutes,weekday
0,0.756,2020,0.221,"['24kGoldn', 'iann dior']",0.7,140526,0.722,1,3tjFYV6RSFtuktYl3ZtYcq,0.0,...,0.272,-3.558,0,Mood (feat. iann dior),99,2020-07-24,0.0369,90.989,2.3421,Friday
1,0.347,2020,0.114,"['Pop Smoke', 'Lil Baby', 'DaBaby']",0.823,190476,0.586,1,0PvFJmanyNQMseIFrU708S,0.0,...,0.193,-6.606,0,For The Night (feat. Lil Baby & DaBaby),95,2020-07-03,0.2,125.971,3.1746,Friday
2,0.357,2020,0.0194,"['Cardi B', 'Megan Thee Stallion']",0.935,187541,0.454,1,4Oun2ylbjFKMPTiaSbbCih,0.0,...,0.0824,-7.509,1,WAP (feat. Megan Thee Stallion),96,2020-08-07,0.375,133.073,3.125683,Friday
3,0.522,2020,0.244,"['Drake', 'Lil Durk']",0.761,261493,0.518,1,2SAqBLGA283SUiwJ3xOUVI,3.5e-05,...,0.107,-8.871,1,Laugh Now Cry Later (feat. Lil Durk),93,2020-08-14,0.134,133.976,4.358217,Friday
4,0.682,2020,0.468,['Ariana Grande'],0.737,172325,0.802,1,35mvY5S1H3J2QZyna3TFe0,0.0,...,0.0931,-4.771,1,positions,96,2020-10-30,0.0878,144.015,2.872083,Friday
