# Pandas

Pandas is a powerful Python library used for data manipulation and analysis. It provides two primary data structures:

- **Series**: A one-dimensional array-like structure.
- **DataFrame**: A two-dimensional tabular structure with rows and columns.

Pandas is widely used in data science and machine learning and in conjnctions wiht its tools for cleaning, filtering, reshaping, and merging data it allows you to view data and perform math on large sets of data.

## setup

In [None]:
!pip install pandas

In [None]:
import pandas as pd #Import the pandas library with pd alias

In [None]:
gameData = pd.read_csv ("vgsales.csv") #import iris data as dataframe
gameData.shape #print the shape of the dataframe

read_csv (like in JSON) method,  for the .csv file, and storing it in a dataframe object. 

Note: A CSV is basicly an excell sheet.

In [None]:
gameData.columns # get the column names

In [None]:
gameData.head(3) # show the first 3 rows of the data

In [None]:
gameData[u'Name'].head(3) #output the first 3 rows of the Name column

In [None]:
gameData.describe() #output the summary statistics of the data
#Take note of the missing data in rank and year
#we will use that later

In [None]:
print(gameData.info()) # get the information about the data
#Take note of the missing data in rank and year
#we will use that later

## Analyzing the Data
Lets swich to a smaller dataset

In [None]:
music_df = pd.read_csv ("music.csv") #import iris data as dataframe
print(music_df) #print the shape of the dataframe DF is a dataframe


### Find the Distribution of Genres

In [None]:
genre_counts = music_df['genre'].value_counts()
print(genre_counts)


### Filter the data

In [None]:
males = music_df[music_df['gender'] == 1]
females = music_df[music_df['gender'] == 0]
print(males.head())
print(females.head())


### Group Data

In [None]:
avg_age_by_genre = music_df.groupby('genre')['age'].mean() 
print(avg_age_by_genre)

In [None]:
music_df.groupby('age')['gender'].mean()

In [None]:
music_df.groupby('genre')['gender'].mean()

In [None]:
#will not work Why?
music_df.groupby('gender')['genre'].mean()


## Sort Data

In [None]:
sorted_music = music_df.sort_values(by='age', ascending=True)
print(sorted_music.head())

## Visualization with pandas and matplotlib 

In [None]:
import matplotlib.pyplot as plt

In [None]:
genre_counts.plot(kind='bar', title='Genre Distribution')
plt.xlabel('Genre')
plt.ylabel('Count')
plt.show()

In [None]:
avg_age_by_genre.plot(kind='bar', color='skyblue', title='Average Age by Genre')
plt.xlabel('Genre')
plt.ylabel('Average Age')
plt.show()

In [None]:
plt.scatter(males['age'], males['genre'], label='Males', color='blue', alpha=0.5)
plt.scatter(females['age'], females['genre'], label='Females', color='pink', alpha=0.5)
plt.title('Age vs Genre by Gender')
plt.xlabel('Age')
plt.ylabel('Genre')
plt.legend()
plt.show()


## Predictions

In [None]:
df = pd.read_csv ("music.csv") #import iris data as dataframe
print(df) #print the shape of the dataframe DF is a dataframe


In [None]:
X = df.drop(columns=['gender']) 
#droping data does not really drop it from the table but drops it from the output talble
#by convention a capital X represents the input set 
X

In [None]:
#create our output set
#by convention we use y for the output set
y = df['genre']
y

# Disission trees

In [None]:
!pip install scikit-learn 
#Sklearn is the most used populare ML library in python


In [None]:
from sklearn.tree import DecisionTreeClassifier


In [None]:
df = pd.read_csv ("music.csv") #import iris data as dataframe
X = df.drop(columns=['genre'])
y = df['genre']
model = DecisionTreeClassifier()
X

In [None]:
model.fit(X,y)

In [None]:
predictions = model.predict([ [21,1] ,[22,0]])
predictions

Rememer you need 3 sets of data. Lets do that now.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)
model.fit(X_train,y_train)
prediction = model.predict(X_test)
score = accuracy_score(y_test,prediction)
score

In [None]:
df

In [None]:
X_test

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.8) # what do you think will be the results
model.fit(X_train,y_train)
prediction = model.predict(X_test)
score = accuracy_score(y_test,prediction)
score

# Model persistance

In [None]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
import joblib #this is a way to save the models 

df = pd.read_csv ("music.csv") #import iris data as dataframe

X = df.drop(columns=['genre'])
y = df['genre']
model = DecisionTreeClassifier()
model.fit(X,y)
joblib.dump(model, 'musik-recomender.joblib')




In [None]:
model = joblib.load('musik-recomender.joblib')
predictions = model.predict([ [21,1]])
predictions