# Pandas

This is my gentle introduction to Python dataframes using pandas. It will include some basic usage and conversions, as data scientists and analysts commonly use it without fully understanding how to convert it into other types.

## Types used by Pandas

### DataFrame

A dataframe in Python is a 2-dimensional data structure that stores data in terms of columns (with names) and values.

### Series

A series in Python is a 1-dimensional data structure that stores data of a single type in an array (list) like structure (more similar to strongly typed arrays)

For more information regarding those types, see the [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#dataframe)

## How to install Pandas?

Using the `pip` package manager for Python, we can use the following command after activating an environment to be safe from having it exposed globally and affecting our system.

```bash
pip install pandas
```

You can see how to install [here](https://pandas.pydata.org/pandas-docs/stable/getting_started/install.html)

## Starting using Pandas

First, we import the module; here we just named it `pd` to make it easier to write. This is a Python feature and not specific to pandas in any way. The same is applied to numpy as well. `Numpy` is a dependency of `Pandas`, so it will be installed as well.

In [1]:
# Import needed modules

# Sometimes you might not need numpy, therefore you can just ignore it
import numpy as np
import pandas as pd

### Loading Data

Pandas uses DataFrame, which is a 2D structure. It is efficient and uses C under the hood for better performance. It allows different source data types like Python dictionaries or files (CSV, JSON, etc). Many of the functions that do this are read_* (where * is the extension of the file). A complete list of extensions is available in Pandas documentation [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html).

Here, the CSV is loaded from a link, but you can load it from disk as well.

In [2]:
movie_df = pd.read_csv("https://raw.githubusercontent.com/MainakRepositor/Datasets/refs/heads/master/movie_rec/movies.csv")

### Getting Some information about the loaded data

In [3]:
# Show the Column Names
# Notice that the columns attribute is a Series
print("DataFrame Columns:")
print(movie_df.columns)

# First 5 Records (Rows)
print("First 5 Records:")
print(movie_df.head())

# First n Records (Rows)
n = 10
print(f"First n Records(n is {n}):")
print(movie_df.head(n))

# Last 5 Records (Rows)
print("Last 5 Records:")
print(movie_df.tail())

# Last n Records (Rows)
n = 10
print(f"Last n Records(n is {n}):")
print(movie_df.tail(n))

# Get more info regarding the dataframe, it includes data types of columns, non-null count, memory usage, etc
print("DataFrame information")
print(movie_df.info())
print("DataFrame types")
print(movie_df.dtypes)
print("DataFrame counts")
print(movie_df.count())

# Get the index of the dataframe
print("DataFrame Index")
print(movie_df.index)

# Get the numpy representation
print("DataFrame raw numpy representation")
print(movie_df.to_numpy())

DataFrame Columns:
Index(['movieId', 'title', 'genres'], dtype='object')
First 5 Records:
   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  
First n Records(n is 10):
   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II

In [4]:
for x in movie_df.index:
    if movie_df.loc[x, 'genres'] == '(no genres listed)':
        movie_df.loc[x, 'genres'] = None

In [5]:
movie_df['genres'] = movie_df['genres'].str.split('|')

In [7]:
duplicates = movie_df.duplicated(subset=['movieId','title'])
duplicates = duplicates[duplicates == True].index

In [8]:
movie_df

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]"
1,2,Jumanji (1995),"[Adventure, Children, Fantasy]"
2,3,Grumpier Old Men (1995),"[Comedy, Romance]"
3,4,Waiting to Exhale (1995),"[Comedy, Drama, Romance]"
4,5,Father of the Bride Part II (1995),[Comedy]
...,...,...,...
10324,146684,Cosmic Scrat-tastrophe (2015),"[Animation, Children, Comedy]"
10325,146878,Le Grand Restaurant (1966),[Comedy]
10326,148238,A Very Murray Christmas (2015),[Comedy]
10327,148626,The Big Short (2015),[Drama]
