### Conectando o Google Drive ao Colab

In [43]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Lendo Dataset em pasta do Google Drive

In [44]:
!unzip "/content/drive/MyDrive/Datasets/Monkeypox/latest.csv.zip"

Archive:  /content/drive/MyDrive/Datasets/Monkeypox/latest.csv.zip
  inflating: latest.csv              
replace __MACOSX/._latest.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: N


### Configurando PySpark no Google Colab

##### Instalação das dependências

In [45]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
!tar xf spark-2.4.4-bin-hadoop2.7.tgz
!pip install -q findspark

##### Configurando as variáveis do ambiente

In [46]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7"

# tornar o pyspark "importável"
import findspark
findspark.init('spark-2.4.4-bin-hadoop2.7')

##### Importando e Criando SparkSession

In [47]:
from pyspark.sql import SparkSession

spark = SparkSession.builder\
        .master("local")\
        .appName("monkeypoxcolab")\
        .config('spark.ui.port', '4050')\
        .getOrCreate()

In [None]:
spark

##### Instalando bibliotecas

In [48]:
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql import functions as f
from IPython.core.display import HTML

### Subindo os dados no PySpark

In [49]:
monkeypoxdf = spark.read.csv("/content/drive/MyDrive/Datasets/latest.csv", sep=",", header=True, inferSchema=True)

##### Mostrando os detalhes das colunas do dataframe

In [50]:
monkeypoxdf.printSchema()

root
 |-- ID: string (nullable = true)
 |-- Status: string (nullable = true)
 |-- Location: string (nullable = true)
 |-- City: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- Country_ISO3: string (nullable = true)
 |-- Age: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Date_onset: timestamp (nullable = true)
 |-- Date_confirmation: timestamp (nullable = true)
 |-- Symptoms: string (nullable = true)
 |-- Hospitalised (Y/N/NA): string (nullable = true)
 |-- Date_hospitalisation: timestamp (nullable = true)
 |-- Isolated (Y/N/NA): string (nullable = true)
 |-- Date_isolation: timestamp (nullable = true)
 |-- Outcome: string (nullable = true)
 |-- Contact_comment: string (nullable = true)
 |-- Contact_ID: integer (nullable = true)
 |-- Contact_location: string (nullable = true)
 |-- Travel_history (Y/N/NA): string (nullable = true)
 |-- Travel_history_entry: string (nullable = true)
 |-- Travel_history_start: string (nullable = true)
 |-- Travel_

#### Número de linhas no Dataframe 
##### *número total de casos rastreados no mundo incluindo a universalidade de status (confirmado, descartado, investigando e omit_error)*




In [51]:
monkeypoxdf.count()

49289

In [52]:
monkeypoxdf.select("Status").distinct().show()

+----------+
|    Status|
+----------+
|omit_error|
|      null|
| suspected|
| confirmed|
| discarded|
+----------+



#### Descrevendo aspectos das colunas

In [53]:
monkeypoxdf.describe().show()

+-------+--------------------+---------+-------------+-------------+-------+------------+-----+------+--------------------+---------------------+-----------------+---------+--------------------+-----------------+--------------------+-----------------------+--------------------+--------------------+-----------------------+----------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------+---------+----------+
|summary|                  ID|   Status|     Location|         City|Country|Country_ISO3|  Age|Gender|            Symptoms|Hospitalised (Y/N/NA)|Isolated (Y/N/NA)|  Outcome|     Contact_comment|       Contact_ID|    Contact_location|Travel_history (Y/N/NA)|Travel_history_entry|Travel_history_start|Travel_history_location|Travel_history_country|   Genomics_Metadata| Confirmation_method|              Source|           Source_II|          Source_III|           Source_IV|Source_V|Source_VI|Source

#### Ranking dos 10 países com maior número de infectados (até 22-08-2022)

In [79]:
monkeypoxdf_total = monkeypoxdf.groupBy("Country").agg(count("Country").alias("total_casos")).orderBy(col("total_casos").desc()).show(10)

+--------------------+-----------+
|             Country|total_casos|
+--------------------+-----------+
|       United States|      16759|
|               Spain|       6470|
|              Brazil|       4186|
|             Germany|       3350|
|             England|       3191|
|              France|       2899|
|Democratic Republ...|       2380|
|              Canada|       1298|
|                Peru|       1210|
|         Netherlands|       1090|
+--------------------+-----------+
only showing top 10 rows



#### Sintomas encontrados entre os infectados pelo vírus

In [85]:
monkeypoxdf.select("Symptoms").distinct().count()

99

In [99]:
monkeypoxdf.select("Symptoms").distinct().show(99)

+--------------------+
|            Symptoms|
+--------------------+
|headache, skin le...|
|Perianal rash, fever|
|Fever, chills, fa...|
|headache, muscle ...|
|fever, outbreak o...|
|                rash|
|rash, body pains,...|
|             lesions|
|genital ulcer les...|
|characteristic sy...|
|      fever, lesions|
| skin manifestations|
|headache, muscle ...|
|symptoms compatib...|
|                Rash|
|      fever; myalgia|
|Spots on skin, ve...|
|         skin rashes|
|                null|
|  ulcerative lesions|
|fever, general di...|
|  myalgias, postules|
|muscle aches, fev...|
|blisters, high fever|
|Fatigue, headache...|
|skin lesions, ulc...|
|oral and genital ...|
|Malaise, headache...|
| fever, skin lesions|
|headache, fever, ...|
|            Vesicles|
|Pain urinating, f...|
|fatigue, fever, s...|
|skin lesions, hea...|
|fever, headache, ...|
|fever, cough, ski...|
|Perianal rash, fe...|
|         Rash, fever|
|skin lesions, fev...|
|fever, headache, ...|
|fatigue, s

In [97]:
#conversão de coluna do Dataframe para Pandas visando melhor visualização dos dados
monkeypoxdf.select("Symptoms").distinct().toPandas()

Unnamed: 0,Symptoms
0,"headache, skin lesions"
1,"Perianal rash, fever"
2,"Fever, chills, fatigue, headache, skin lesions"
3,"headache, muscle pain, back pain, vasicular ra..."
4,"fever, outbreak on the skin, hands, and chest"
...,...
94,"general weakness, fever, skin rashes"
95,genital ulcers
96,"Fever, skin rashes"
97,"Rash, muscle ache, fatigue"


In [100]:
monkeypoxdf.select("Symptoms","Gender").distinct().toPandas()

Unnamed: 0,Symptoms,Gender
0,lesions,male
1,"headache, skin lesions",
2,ulcerative lesions,male
3,Rash,Male
4,"fever, skin lesions",male
...,...,...
106,"Swelling, fever, rash, diarrhea",Male
107,"lower abdomen skin lesions, fatigue, swollen l...",male
108,"Perianal rash, fever",Male
109,"rash, body pains, fever",male


#####Aplicando a função split nas strings dos sintomas