## Hadoop: Writing and Reading Data with HDFS

This notebook demonstrates how to interact with Hadoop's Distributed File System (HDFS) using Python. We will cover the fundamental operations of writing data to HDFS and reading data from HDFS.


#### 1. Install python packages

After selecting the appropriate kernel and creating a Python virtual environment (e.g., named .venv), install the following packages:

`hdfs` – A Python library used for interacting with the Hadoop Distributed File System (HDFS). It provides functionality to connect to HDFS clusters, read and write files, and manage HDFS directories.

`pandas` – A popular Python library for data manipulation and analysis. It provides powerful data structures like DataFrame and a wide range of functions to work with structured data.

In [None]:
! pip install hdfs pandas

#### 2. Connect to HDFS

Check connectivity to the HDFS system and create a client object.

In [None]:
from hdfs import InsecureClient
try:
    client = InsecureClient('http://namenode:9870', user='root')
    print("Connected!")
except Exception as e:
    print(e)

#### 3. Check HDFS folders and files

In [None]:
files = client.list('/')
print(files)

#### 4. Download and save the sample data file to your local machine.

In [None]:
import urllib.request

url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
urllib.request.urlretrieve(url,'sample_data.csv')

#### 5. Copy the file to the HDFS

Check on browser [localhost:9870](http://localhost:9870/explorer.html#/)

In [None]:
client.upload(hdfs_path='data.csv', local_path='sample_data.csv', overwrite=True, permission=775)

#### 6. Read data from HDFS

Read data from HDFS into a pandas dataframe.

In [None]:
import pandas as pd

with client.read(hdfs_path='/user/root/data.csv', encoding='utf8') as file:
  df = pd.read_csv(file)

df