# Volcanoes in Venus

In [2]:
import pandas as pd
import plotly.express as px


## Labels Dataset

In [25]:
labels_df = pd.read_csv('/home/sesso/Documents/Study/pytorch_learn/volcanoes_venus/venus_data/archive/volcanoes_train/train_labels.csv')
labels_df_volc = labels_df[labels_df['Volcano?']>0]
labels_df_volc['type_cat'] = labels_df_volc['Type']
labels_df_volc['type_cat'].astype('str')

labels_df.head()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,Volcano?,Type,Radius,Number Volcanoes
0,1,3.0,17.46,1.0
1,0,,,
2,0,,,
3,0,,,
4,0,,,


In [36]:

print(f'Shape: {labels_df.shape}')
print('--- Missing Values ---')
print(labels_df.isna().sum())

x = len(labels_df_volc)/len(labels_df)
y = 1 - x

print('--- Imbalancedness ---')
print('Volcanoes :',x*100,'%')
print('non Volcanoes :',y*100,'%')

Shape: (7000, 4)
--- Missing Values ---
Volcano?               0
Type                6000
Radius              6000
Number Volcanoes    6000
dtype: int64
--- Imbalancedness ---
Volcanoes : 14.285714285714285 %
non Volcanoes : 85.71428571428572 %


Checking dataset columns:

- Volcano?: if in the image there are volcanoes (Main target), 1 or 0.

for Volcano?=0 this three next features are NaN

Type:

  - 1 = definitely a volcano
  - 2 = probably
  - 3 = possibly
  - 4 = only a pit is visible

Radius: is the radius of the volcan in the center of the image, in pixels

Number Volcanoes: The number of volcanoes in the image

### Plots and tables

#### Type

As seen below, there is a lot of uncertainty in what is defined as a volcano. Rouglhy speaking, 83% of the volcanos detected fall in "possibly" or "only a pit is visible". 


In [22]:
fig = px.pie(labels_df_volc,  values='Type', names='type_cat')
fig.show()

#### Radius

Thhe figure below plots Radius distribution.

In [23]:
fig = px.histogram(labels_df_volc, x="Radius")
fig.show()

#### Number Volcanoes

Thhe figure below plots Number of Volcanoes distribution.

In [24]:
fig = px.histogram(labels_df_volc, x="bNumber Volcanoes")
fig.show()

## Image dataset


In [45]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [38]:
image_df = pd.read_csv('/home/sesso/Documents/Study/pytorch_learn/volcanoes_venus/venus_data/archive/volcanoes_train/train_images.csv',
                        header = None)

image_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,12090,12091,12092,12093,12094,12095,12096,12097,12098,12099
0,95,101,99,103,95,86,96,89,70,104,...,111,107,92,89,103,99,117,116,118,96
1,91,92,91,89,92,93,96,101,107,104,...,103,92,93,95,98,105,104,100,90,81
2,87,70,72,74,84,78,93,104,106,106,...,84,71,95,102,94,80,91,80,84,90
3,0,0,0,0,0,0,0,0,0,0,...,94,81,89,84,80,90,92,80,88,96
4,114,118,124,119,95,118,105,116,123,112,...,116,113,102,93,109,104,106,117,111,115


In [40]:
print(f'Shape: {image_df.shape}')
print('--- Missing Values ---')
print(image_df.isna().sum())
print('--- Missing Values (total) ---')
print(image_df.isna().sum().sum())
# x = len(image_df_volc)/len(image_df)
# y = 1 - x

# print('--- Imbalancedness ---')
# print('Volcanoes :',x*100,'%')
# print('non Volcanoes :',y*100,'%')

Shape: (7000, 12100)
--- Missing Values ---
0        0
1        0
2        0
3        0
4        0
        ..
12095    0
12096    0
12097    0
12098    0
12099    0
Length: 12100, dtype: int64
--- Missing Values (total) ---
0
