# **PROBLEM STATEMENT**
### **Overview:**
Your task is to secure the Pandas library and its dependencies in a
Python environment, ensuring safe operation particularly in restricted or "jailed"
setups. You will identify potential security risks involving unsafe system operations
and mitigate them.
### **Background:**
Pandas is extensively used for data manipulation and analysis. In
environments where security is paramount, especially web-based applications, it is
crucial to restrict the library's capabilities to prevent unauthorized system access
(including data access).
### **Challenge Details:**
1. Dependency Analysis: Identify and list all dependencies of the Pandas
library. Narrow down these dependencies to a subset that could potentially
include unsafe operations related to system access (e.g., sys, os, requests
modules).
2. Code Analysis: Analyze the source code of Pandas and its critical
dependencies. Identify all functions and methods in these packages that
could invoke unsafe operations. Map out a call graph for Pandas and the
selected dependencies to see which high-level functions can lead to these
unsafe operations.
3. Pruning Unsafe Features: Propose methods to modify or restrict access to
these unsafe functions. Ensure that core functionalities of Pandas and its
dependencies remain intact and usable while enhancing security.
4. Documentation: Document your findings, including a list of dependencies
analyzed, identified risky functions, and your proposed solutions. Explain
your approach and the steps taken to secure the library and its
dependencies, providing code snippets and diagrams where applicable.

### Task 1: **Dependency analysis**
The Pandas has a set of Primary and Optional dependencies which it works. The following is the exhaustive dependecy list that pandas requires.

In [None]:
import pandas as pd
print(pd.show_versions())




INSTALLED VERSIONS
------------------
commit           : 0f437949513225922d851e9581723d82120684a6
python           : 3.10.12.final.0
python-bits      : 64
OS               : Linux
OS-release       : 6.1.85+
Version          : #1 SMP PREEMPT_DYNAMIC Sun Apr 28 14:29:16 UTC 2024
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : en_US.UTF-8
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 2.0.3
numpy            : 1.25.2
pytz             : 2023.4
dateutil         : 2.8.2
setuptools       : 67.7.2
pip              : 23.1.2
Cython           : 3.0.10
pytest           : 7.4.4
hypothesis       : None
sphinx           : 5.0.2
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : 4.9.4
html5lib         : 1.1
pymysql          : None
psycopg2         : 2.9.9
jinja2           : 3.1.4
IPython          : 7.34.0
pandas_datareader: 0.10.0
bs4              : 4.12.3
bottleneck       : None

Out of these, the following primary dependencies would be used will be selected for our research on unsafe modules.

#### Libraries selected are:

*   **Numpy** due to its data manipulation requirements
*   **bs4** since this this module works parsed external content
*   **pymysql** & **sqlalchemy** since these works with db hosted locally or public ones.
*   **fsspec** since this library works with filesystem operations

#### Unsafe Operations:
The Unsafe operations in python covers a wide range of operations and method that maybe secure in concerns on improper use or that which may affect the normal code functions. In this assignment, we will look into the latter and focus on these low-level modules that will could be used for such operations.

##### **System Operations :**
  * os :
        'system', 'popen', 'spawn', 'execv', 'execl'
  * sys :
        'exit', 'setrecursionlimit', 'settrace'
  * subprocess:
        'call', 'Popen', 'run'
  * pickle :
        'load', 'loads', 'Unpickler'
  * shutil :
        rmtree

##### **WebAccess :**
Since the problem statement mentions the need for webaccess, these modules will also be analyzed for safe operations:
  * socket :
        'socket', 'create_connection', 'getaddrinfo', 'gethostbyname',
        'gethostbyname_ex', 'gethostbyaddr', 'create_server', 'fromfd'
  * requests :
          'get', 'post', 'put', 'delete', 'head', 'options', 'patch', 'Session'
  * ftplib

  

### Task 2: **Code Analysis**
The Code analyis includes on how to call put the possibly unsafe operations in the pandas library and its dependencies

**Method 1:**

The main pandas library and its dependency list could be searched using the as walk feature to identify a limited set of harmful operations which we have filtered out.

In [None]:
import ast
def analyze_file(filepath):
    with open(filepath, 'r', encoding='utf-8') as file:
        try:
            tree = ast.parse(file.read(), filename=filepath)
        except SyntaxError:
            return []

    unsafe_calls = []

    class UnsafeCallVisitor(ast.NodeVisitor):
        def visit_Import(self, node):
            for alias in node.names:
                if alias.name in UNSAFE_MODULES:
                    unsafe_calls.append((alias.name, node.lineno))
            self.generic_visit(node)

        def visit_ImportFrom(self, node):
            if node.module in UNSAFE_MODULES:
                unsafe_calls.append((node.module, node.lineno))
            self.generic_visit(node)

        def visit_Call(self, node):
            if isinstance(node.func, ast.Attribute):
                module_name = node.func.value.id if isinstance(node.func.value, ast.Name) else None
                func_name = node.func.attr
                if module_name in UNSAFE_MODULES and (func_name in UNSAFE_MODULES[module_name] or not UNSAFE_MODULES[module_name]):
                    unsafe_calls.append((f"{module_name}.{func_name}", node.lineno))
            self.generic_visit(node)

    visitor = UnsafeCallVisitor()
    visitor.visit(tree)

    return unsafe_calls

In [None]:
import os

# List of potentially unsafe modules and their functions
UNSAFE_MODULES = {
    'os': ['system', 'popen', 'spawn', 'execv', 'execl'],
    'sys': ['exit', 'setrecursionlimit', 'settrace'],
    'subprocess': ['call', 'Popen', 'run'],
    'shutil': ['rmtree'],
    'socket': [],
    'requests': [],
    'ftplib': [],
    'pickle': ['load', 'loads', 'Unpickler']
}


def analyze_directory(directory):
    unsafe_operations = {}
    for root, _, files in os.walk(directory):
        for file in files:
            if file.endswith('.py'):
                filepath = os.path.join(root, file)
                unsafe_calls = analyze_file(filepath)
                if unsafe_calls:
                    unsafe_operations[filepath] = unsafe_calls
    return unsafe_operations

def generate_report(unsafe_operations):
    report = []
    for filepath, operations in unsafe_operations.items():
        report.append(f"File: {filepath}")
        for op in operations:
            report.append(f"  Line {op[1]}: {op[0]}")
    return "\n".join(report)

if __name__ == "__main__":
    directory_to_analyze = '/usr/local/lib/python3.10/dist-packages/pandas'
    unsafe_operations = analyze_directory(directory_to_analyze)
    report = generate_report(unsafe_operations)
    print(report)

File: /usr/local/lib/python3.10/dist-packages/pandas/_typing.py
  Line 8: os
File: /usr/local/lib/python3.10/dist-packages/pandas/conftest.py
  Line 32: os
  Line 8: sys
File: /usr/local/lib/python3.10/dist-packages/pandas/_testing/contexts.py
  Line 4: os
File: /usr/local/lib/python3.10/dist-packages/pandas/_testing/_io.py
  Line 7: socket
File: /usr/local/lib/python3.10/dist-packages/pandas/_testing/__init__.py
  Line 7: os
  Line 10: sys
File: /usr/local/lib/python3.10/dist-packages/pandas/io/pickle.py
  Line 4: pickle
  Line 196: pickle.load
File: /usr/local/lib/python3.10/dist-packages/pandas/io/stata.py
  Line 17: os
  Line 19: sys
File: /usr/local/lib/python3.10/dist-packages/pandas/io/pytables.py
  Line 14: os
File: /usr/local/lib/python3.10/dist-packages/pandas/io/parquet.py
  Line 5: os
File: /usr/local/lib/python3.10/dist-packages/pandas/io/common.py
  Line 22: os
File: /usr/local/lib/python3.10/dist-packages/pandas/io/formats/printing.py
  Line 6: sys
File: /usr/local/lib/p

Now, analysing the filtered dependency list for unsafe operations

In [5]:
import os
import ast
import sys
import pkg_resources

# Increase the recursion limit
sys.setrecursionlimit(3000)

# List of potentially unsafe modules and their functions
UNSAFE_MODULES = {
    'os': ['system', 'popen', 'spawn', 'execv', 'execl'],
    'sys': ['exit', 'setrecursionlimit', 'settrace'],
    'subprocess': ['call', 'Popen', 'run'],
    'shutil': ['rmtree'],
    'socket': [
        'socket', 'create_connection', 'getaddrinfo', 'gethostbyname',
        'gethostbyname_ex', 'gethostbyaddr', 'create_server', 'fromfd'
    ],
    'requests': [
        'get', 'post', 'put', 'delete', 'head', 'options', 'patch', 'Session'
    ],
    'ftplib': [],
    'pickle': ['load', 'loads', 'Unpickler']
}

def analyze_file(filepath):
    encodings = ['utf-8', 'latin-1', 'iso-8859-1']
    for encoding in encodings:
        try:
            with open(filepath, 'r', encoding=encoding) as file:
                try:
                    tree = ast.parse(file.read(), filename=filepath)
                except SyntaxError:
                    return []
            break
        except (UnicodeDecodeError, FileNotFoundError):
            continue
    unsafe_calls = []

    class UnsafeCallVisitor(ast.NodeVisitor):
        def visit_Import(self, node):
            for alias in node.names:
                if alias.name in UNSAFE_MODULES:
                    unsafe_calls.append((alias.name, node.lineno))
            self.generic_visit(node)

        def visit_ImportFrom(self, node):
            if node.module in UNSAFE_MODULES:
                unsafe_calls.append((node.module, node.lineno))
            self.generic_visit(node)

        def visit_Call(self, node):
            if isinstance(node.func, ast.Attribute):
                module_name = node.func.value.id if isinstance(node.func.value, ast.Name) else None
                func_name = node.func.attr
                if module_name in UNSAFE_MODULES and (func_name in UNSAFE_MODULES[module_name] or not UNSAFE_MODULES[module_name]):
                    unsafe_calls.append((f"{module_name}.{func_name}", node.lineno))
            self.generic_visit(node)

    visitor = UnsafeCallVisitor()
    visitor.visit(tree)

    return unsafe_calls

def analyze_directory(directory):
    unsafe_operations = {}
    for root, _, files in os.walk(directory):
        for file in files:
            if file.endswith('.py'):
                filepath = os.path.join(root, file)
                unsafe_calls = analyze_file(filepath)
                if unsafe_calls:
                    unsafe_operations[filepath] = unsafe_calls
    return unsafe_operations

def generate_report(unsafe_operations):
    report = []
    for filepath, operations in unsafe_operations.items():
        report.append(f"File: {filepath}")
        for op in operations:
            report.append(f"  Line {op[1]}: {op[0]}")
    return "\n".join(report)

def get_dependency_directories():
    dependency_dirs = []
    for dist in pkg_resources.working_set:
        if dist.project_name in DEPENDENCIES:
            try:
                dependency_dirs.append(dist.location)
            except KeyError:
                continue
    return dependency_dirs

if __name__ == "__main__":
    DEPENDENCIES = [
        'numpy', 'pymysql','bs4', 'fsspec',
        'sqlalchemy'
    ]

    unsafe_operations = {}
    for directory in get_dependency_directories():
        module_operations = analyze_directory(directory)
        unsafe_operations.update(module_operations)

    report = generate_report(unsafe_operations)
    print(report)


File: /usr/local/lib/python3.10/dist-packages/socks.py
  Line 10: os
  Line 11: os
  Line 12: socket
  Line 14: sys
  Line 180: socket.getaddrinfo
  Line 573: socket.getaddrinfo
  Line 622: socket.gethostbyname
  Line 671: socket.gethostbyname
  Line 750: socket.gethostbyname
File: /usr/local/lib/python3.10/dist-packages/portpicker.py
  Line 41: os
  Line 43: socket
  Line 44: sys
  Line 113: socket.socket
  Line 236: socket.socket
  Line 239: socket.socket
  Line 331: sys.exit
File: /usr/local/lib/python3.10/dist-packages/nest_asyncio.py
  Line 5: os
  Line 6: sys
File: /usr/local/lib/python3.10/dist-packages/entrypoints.py
  Line 12: sys
File: /usr/local/lib/python3.10/dist-packages/pwiz.py
  Line 4: os
  Line 5: sys
  Line 47: sys.exit
  Line 208: sys.exit
File: /usr/local/lib/python3.10/dist-packages/ipykernel_launcher.py
  Line 7: sys
File: /usr/local/lib/python3.10/dist-packages/appdirs.py
  Line 20: sys
  Line 21: os
File: /usr/local/lib/python3.10/dist-packages/pydot.py
  Line 

**Method 2:  bandit**
bandit is a static analysis tool in python which can be used to create call graphs. We will be using bandit for creating call graph for *pandas* and its filtered dependencies. For simplicity, we will be looking at methods that has ***high*** severity.

please install bandit using:

```
pip install bandit
```

In [6]:
!bandit -r '/usr/local/lib/python3.10/dist-packages/pandas' --severity high
!bandit -r '/usr/local/lib/python3.10/dist-packages/sqlalchemy' --severity high
!bandit -r '/usr/local/lib/python3.10/dist-packages/requests' --severity high
!bandit -r '/usr/local/lib/python3.10/dist-packages/fsspec' --severity high
!bandit -r '/usr/local/lib/python3.10/dist-packages/bs4' --severity high
!bandit -r '/usr/local/lib/python3.10/dist-packages/pymysql' --severity high

[main]	INFO	profile include tests: None
[main]	INFO	profile exclude tests: None
[main]	INFO	cli include tests: None
[main]	INFO	cli exclude tests: None
[main]	INFO	running on Python 3.10.12
[2KWorking... [91m━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━[0m [35m 47%[0m [36m0:01:05[0m
[?25h[main]	INFO	profile include tests: None
[main]	INFO	profile exclude tests: None
[main]	INFO	cli include tests: None
[main]	INFO	cli exclude tests: None
[main]	INFO	running on Python 3.10.12
[2KWorking... [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [33m0:00:26[0m
[?25h[95mRun started:2024-05-19 14:14:27.860749[0m
[95m
Test results:[0m
[91m>> Issue: [B324:hashlib] Use of weak MD5 hash for security. Consider usedforsecurity=False
   Severity: High   Confidence: High
   CWE: CWE-327 (https://cwe.mitre.org/data/definitions/327.html)
   More Info: https://bandit.readthedocs.io/en/1.7.8/plugins/b324_hashlib.html
   Location: /usr/local/lib/python3.10/dist-pack

#####Conclusion:

The Modules analysis using ast along with call graph can be used to replace / prune unsafe operations based on priority.

### Task 3: Pruning unsafe Features

Pruning such methods can be performed by monkey-catching such instances. For example let's consider the following code which tries to read and write a csv.


```
import pandas as pd

# Example usage
data = pd.read_csv('/tmp/unsafe_data.csv')
data.to_csv(, '/tmp/unsafe_data_copy.csv')
```

This file reads data from an arbitrary filepath which maybe harmful.
To fix this without affecting the code functionality, we can describe a SAFE_DIRECTORY where read&write functions are allowed. The function may only work in such instances.  

Code:

```
import pandas as pd
import os

# Define safe directory
SAFE_DIRECTORY = '/safe/directory/'

# Original functions
_original_read_csv = pd.read_csv
_original_to_csv = pd.DataFrame.to_csv

# Safe read_csv function
def safe_read_csv(file_path, *args, **kwargs):
    if not file_path.startswith(SAFE_DIRECTORY):
        raise PermissionError("Unsafe file operation detected")
    return _original_read_csv(file_path, *args, **kwargs)

# Safe to_csv function
def safe_to_csv(self, file_path, *args, **kwargs):
    if not file_path.startswith(SAFE_DIRECTORY):
        raise PermissionError("Unsafe file operation detected")
    return _original_to_csv(self, file_path, *args, **kwargs)

# Apply monkey patch
pd.read_csv = safe_read_csv
pd.DataFrame.to_csv = safe_to_csv

# Example of Pandas code using the monkey-patched functions
def read_data(file_path):
    return pd.read_csv(file_path)

def write_data(data, file_path):
    data.to_csv(file_path)

# Example usage
try:
    data = read_data('/safe/directory/unsafe_data.csv')
    write_data(data, '/safe/directory/unsafe_data_copy.csv')
except PermissionError as e:
    print(e)
```

