<a href="https://colab.research.google.com/github/guilhermelaviola/IntelligentCommunication/blob/main/Class06.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Data Collection and Storage Techniques**
Integration with external data capture systems, such as third-party APIs, IoT sensors, and social media platforms, is becoming increasingly common. This allows companies to extract and process information directly within their internal systems, making it more accessible for analysis and decision-making. However, due to unpredictable formats and potential system performance impacts, security and performance considerations must be considered. Efficient data storage is crucial for systems processing large amounts of data. Factors like data type, volume, system scalability, and fast access influence storage. To achieve this, use databases that can handle large volumes without compromising performance. Mix relational and non-relational databases for different data types. Data compression reduces space without compromising integrity. Indexes in databases ensure quick queries without searching for information in every record. Data cleansing and validation are essential for data management to prevent noise, errors, and duplicate data. Cleansing involves standardizing input formats, removing duplicates, and correcting errors. Validation ensures data meets expected standards before saving to the database. This process should be performed on multiple layers to prevent users from entering incorrect data and to ensure data from external sources or APIs aligns with system requirements. Organizations are increasingly managing large volumes of big data due to the internet expansion, IoT proliferation, and digitalization. Distributed storage and processing, using technologies like Hadoop and Apache Spark, are key approaches for handling large volumes of data effectively.

In [1]:
# Importing all the necessary libraries and resources:
import pandas as pd
import numpy as np

In [2]:
# Defining a seed for reproduITbilage:
np.random.seed(42)

# Simulating data:
n = 100
names = ['Alice', 'Bob', 'Charlie', 'David', 'Eva']
data = {
    'id': range(1, n + 1),
    'name': np.random.choice(names, n),
    'age': np.random.randint(22, 60, n),
    'salary': np.round(np.random.uniform(30000, 120000, n), 2),
    'departament': np.random.choice(['Sales', 'IT', 'RH', 'Finance'], n),
}

In [3]:
# Creating a DataFrame:
df = pd.DataFrame(data)

# Introducting some null and duplicated values for simulation:
df.loc[5, 'salary'] = np.nan
df = pd.concat([df, df.iloc[[3]]])

# Displaying the simulated data:
print('Simulated data:')
print(df.head())

Simulated data:
   id     name  age    salary departament
0   1    David   49  58870.21       Sales
1   2      Eva   28  46786.67       Sales
2   3  Charlie   30  33669.76          RH
3   4      Eva   29  83180.36          IT
4   5      Eva   33  90980.79     Finance


In [4]:
# Data treatment and cleaning:
# Checking null data:
print('\nChecking null values:')
print(df.isnull().sum())

# Filling null entries safely:
df['salary'] = df['salary'].fillna(df['salary'].mean())

# Removing duplicates:
df.drop_duplicates(inplace=True)

# Checking the data after the treatment:
print('\nData after treatment:')
print(df.info())

# Simple analysis:
avg_salary_by_departament = df.groupby('departament')['salary'].mean()
print('\nAverage salary by department:')
print(avg_salary_by_departament)

# Filtering data:
employees_above_50k = df[df['salary'] > 50000]
print('\nEmployees with salary above 50k:')
print(employees_above_50k)


Checking null values:
id             0
name           0
age            0
salary         1
departament    0
dtype: int64

Data after treatment:
<class 'pandas.core.frame.DataFrame'>
Index: 100 entries, 0 to 99
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           100 non-null    int64  
 1   name         100 non-null    object 
 2   age          100 non-null    int64  
 3   salary       100 non-null    float64
 4   departament  100 non-null    object 
dtypes: float64(1), int64(2), object(2)
memory usage: 4.7+ KB
None

Average salary by department:
departament
Finance    72704.478148
IT         83793.250714
RH         79217.790063
Sales      71819.087619
Name: salary, dtype: float64

Employees with salary above 50k:
     id     name  age       salary departament
0     1    David   49   58870.2100       Sales
3     4      Eva   29   83180.3600          IT
4     5      Eva   33   90980.7900     Finance
5     6