### 1.0 Downloading the dataset
- utils.py has a function used to download the dataset from the kaggle website
- function takes two arguments copy_from_url and path on which the dataset needs to be downloaded
- copy_from_url is a string string identified of the dataset should be in format [owner]/[dataset-name].
- copy_from_url can be identified from the link to the dataset "https://www.kaggle.com/datasets/**muratkokludataset/pistachio-dataset**/data"

#### Insights
- no missing values
- There are 2148 samples and 29 features in the dataset
- There is 1 categorical features - ['Class']
- There are 28 numerical features - ['Area', 'Perimeter', 'Major_Axis', 'Minor_Axis', 'Eccentricity', 'Eqdiasq', 'Solidity', 'Convex_Area', 'Extent', 'Aspect_Ratio', 'Roundness', 'Compactness', 'Shapefactor_1', 'Shapefactor_2', 'Shapefactor_3', 'Shapefactor_4', 'Mean_RR', 'Mean_RG', 'Mean_RB', 'StdDev_RR', 'StdDev_RG', 'StdDev_RB', 'Skew_RR', 'Skew_RG', 'Skew_RB', 'Kurtosis_RR', 'Kurtosis_RG', 'Kurtosis_RB']
- 50% of pistachio data has area less than 79905.500000 
- only 25 % of pistachio data has perimeter more than 1607.906250

In [1]:
# downloading the dataset from the kaggle website
import os 
from utils import download
copy_from_url = "muratkokludataset/pistachio-dataset"
path = os.path.join(os.getcwd(),"data")
download(copy_from_url, path)

### 2.0 Read the CSV file using Pandas
- read_csv module from pandas is used to read the downloaded dataset

In [4]:
import pandas as pd
from pathlib import  Path
df = pd.read_excel("data/Pistachio_Dataset/Pistachio_28_Features_Dataset/Pistachio_28_Features_Dataset.xlsx")

Unnamed: 0,Area,Perimeter,Major_Axis,Minor_Axis,Eccentricity,Eqdiasq,Solidity,Convex_Area,Extent,Aspect_Ratio,...,StdDev_RR,StdDev_RG,StdDev_RB,Skew_RR,Skew_RG,Skew_RB,Kurtosis_RR,Kurtosis_RG,Kurtosis_RB,Class
0,63391,1568.4050,390.3396,236.7461,0.7951,284.0984,0.8665,73160,0.6394,1.6488,...,17.7206,19.6024,21.1342,0.4581,0.6635,0.7591,2.9692,3.0576,2.9542,Kirmizi_Pistachio
1,68358,1942.1870,410.8594,234.7525,0.8207,295.0188,0.8765,77991,0.6772,1.7502,...,26.7061,27.2112,25.1035,-0.3847,-0.2713,-0.2927,1.9807,2.1006,2.2152,Kirmizi_Pistachio
2,73589,1246.5380,452.3630,220.5547,0.8731,306.0987,0.9172,80234,0.7127,2.0510,...,19.0129,20.0703,20.7006,-0.6014,-0.4500,0.2998,3.5420,3.6856,4.1012,Kirmizi_Pistachio
3,71106,1445.2610,429.5291,216.0765,0.8643,300.8903,0.9589,74153,0.7028,1.9879,...,18.1773,18.7152,29.7883,-0.6943,-0.6278,-0.7798,2.8776,2.8748,2.8953,Kirmizi_Pistachio
4,80087,1251.5240,469.3783,220.9344,0.8823,319.3273,0.9657,82929,0.7459,2.1245,...,23.4298,24.0878,23.1157,-0.9287,-0.8134,-0.4970,2.9915,2.8813,2.7362,Kirmizi_Pistachio
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2143,85983,1157.1160,444.3447,248.8627,0.8284,330.8730,0.9823,87536,0.6799,1.7855,...,20.8474,20.8118,21.1175,-0.6994,-0.7071,-0.6963,2.8853,2.6599,2.6317,Siirt_Pistachio
2144,85691,2327.3459,439.8794,278.9297,0.7732,330.3107,0.8886,96439,0.6590,1.5770,...,21.2621,22.5004,21.5821,-0.5567,-0.4968,-0.6597,2.3022,2.2664,2.5161,Siirt_Pistachio
2145,101136,1255.6190,475.2161,271.3299,0.8210,358.8459,0.9888,102286,0.7584,1.7514,...,21.1262,20.0279,17.4401,-0.9072,-0.8790,-0.4470,3.3112,3.4306,3.0697,Siirt_Pistachio
2146,97409,1195.2150,452.1823,274.5764,0.7945,352.1718,0.9902,98376,0.7635,1.6468,...,19.3274,19.1782,19.8930,-0.9473,-0.8404,-0.3153,3.4237,2.9606,3.0033,Siirt_Pistachio


In [35]:
df.head()
df.isna().sum() 
# """
# find the number of missing values in a dataset using the isna() method, which returns a DataFrame of the same shape 
# as the input with True where a missing value is present and False otherwise. You can then use sum() to count the number of 
# True values in each column, giving you the count of missing values. Here's an example:
# """

Area             0
Perimeter        0
Major_Axis       0
Minor_Axis       0
Eccentricity     0
Eqdiasq          0
Solidity         0
Convex_Area      0
Extent           0
Aspect_Ratio     0
Roundness        0
Compactness      0
Shapefactor_1    0
Shapefactor_2    0
Shapefactor_3    0
Shapefactor_4    0
Mean_RR          0
Mean_RG          0
Mean_RB          0
StdDev_RR        0
StdDev_RG        0
StdDev_RB        0
Skew_RR          0
Skew_RG          0
Skew_RB          0
Kurtosis_RR      0
Kurtosis_RG      0
Kurtosis_RB      0
Class            0
dtype: int64

### 3.0 Exploring Dataset Characteristics
The following code is useful for obtaining a quick overview of the dataset, including the number of samples, number of features, and a breakdown of categorical and numerical features. It can be particularly helpful during the initial stages of data exploration and analysis.
1. Shape of the DataFrame

```python
samples, features = df.shape
print(f"There are {samples} samples and {features} features in the dataset")
```

2. Identification of Categorical and Numerical Features

```python
categorical_features = [cname for cname in df.columns if df[cname].dtype=='O']
numerical_features = [cname for cname in df.columns if df[cname].dtype!='O']
```
- Two lists, categorical_features and numerical_features, are created to store the names of categorical and numerical features, respectively.
- Categorical features are identified based on their data type being 'O' (object), while numerical features have a data type other than 'O'


In [24]:

samples, features = df.shape
print(f"There are {samples} samples and {features} features in the dataset")
categorical_features = [cname for cname in df.columns if df[cname].dtype=='O']
numerical_features = [cname for cname in df.columns if df[cname].dtype!='O']
print(f"There is {len(categorical_features)} categorical features - {categorical_features}")
print(f"There are {len(numerical_features)} numerical features - \
{numerical_features}")


There are 2148 samples and 29 features in the dataset
There is 1 categorical features - ['Class']
There are 28 numerical features - ['Area', 'Perimeter', 'Major_Axis', 'Minor_Axis', 'Eccentricity', 'Eqdiasq', 'Solidity', 'Convex_Area', 'Extent', 'Aspect_Ratio', 'Roundness', 'Compactness', 'Shapefactor_1', 'Shapefactor_2', 'Shapefactor_3', 'Shapefactor_4', 'Mean_RR', 'Mean_RG', 'Mean_RB', 'StdDev_RR', 'StdDev_RG', 'StdDev_RB', 'Skew_RR', 'Skew_RG', 'Skew_RB', 'Kurtosis_RR', 'Kurtosis_RG', 'Kurtosis_RB']


`describe()` method in Pandas is used to generate descriptive statistics of a DataFrame. It provides a summary of central tendency, dispersion, and shape of the distribution of a dataset
The output of df.describe() includes the following statistics for each column:

- count: The number of non-null values (missing values are excluded).
- mean: The mean (average) of the values.
- std: The standard deviation, a measure of the amount of variation or dispersion.
- min: The minimum value in the column.
- 25%: The first quartile, or the 25th percentile.
- 50%: The median, or the 50th percentile.
- 75%: The third quartile, or the 75th percentile.
- max: The maximum value in the column.

In [36]:
df.describe()

Unnamed: 0,Area,Perimeter,Major_Axis,Minor_Axis,Eccentricity,Eqdiasq,Solidity,Convex_Area,Extent,Aspect_Ratio,...,Mean_RB,StdDev_RR,StdDev_RG,StdDev_RB,Skew_RR,Skew_RG,Skew_RB,Kurtosis_RR,Kurtosis_RG,Kurtosis_RB
count,2148.0,2148.0,2148.0,2148.0,2148.0,2148.0,2148.0,2148.0,2148.0,2148.0,...,2148.0,2148.0,2148.0,2148.0,2148.0,2148.0,2148.0,2148.0,2148.0,2148.0
mean,79950.655493,1425.971751,446.248968,238.311842,0.840219,317.919173,0.940093,85015.839851,0.716067,1.898154,...,191.995311,21.380084,22.591454,22.427056,-0.735243,-0.61558,-0.367142,3.054,2.903015,2.940572
std,13121.737799,375.565503,32.445304,30.310695,0.048759,26.9086,0.050452,13154.919327,0.052532,0.2401,...,13.030505,3.127813,3.622222,3.926325,0.384584,0.389219,0.426964,0.733993,0.651383,0.750171
min,29808.0,858.363,320.3445,133.5096,0.5049,194.8146,0.588,37935.0,0.4272,1.1585,...,146.7876,10.6111,11.9854,11.1971,-1.9316,-1.6582,-2.3486,1.6624,1.6655,1.5225
25%,71936.75,1170.99625,426.50875,217.875825,0.8175,302.64285,0.91985,76467.0,0.687,1.736375,...,182.930675,19.25355,20.036675,19.722425,-0.9909,-0.875975,-0.6458,2.5097,2.4374,2.449425
50%,79905.5,1262.7855,448.57475,236.41635,0.84965,318.9653,0.95415,85075.5,0.7265,1.89625,...,192.03635,21.4251,22.52325,22.2769,-0.7566,-0.65305,-0.42455,2.94175,2.80705,2.78335
75%,89030.5,1607.90625,468.5094,257.76015,0.8752,336.685525,0.976925,93893.5,0.7536,2.067025,...,201.097725,23.6959,25.2419,25.140125,-0.5025,-0.405,-0.1584,3.4465,3.2474,3.22465
max,124008.0,2755.0491,541.9661,383.0461,0.946,397.3561,0.9951,132478.0,0.8204,3.0858,...,235.0007,30.8383,33.6146,42.7566,1.8654,2.2576,1.8521,8.8906,10.4539,11.5339
